$STW-$ $7089CME$: Modeling $CCPP$ Power Plant Energy Output Using Nonlinear Regression:¶

Exploring Data:

The power plant is running at full capacity for six years (2006-2011).

The provided dataset contains 9,568 hourly measurements of how much electricity the plant produced. The dataset contains four independent environmental variables and one dependent variable.

Four key environmental factors that affect the plant's performance are:

  1. Temperature (T): The outside air temperature in °C
  2. Pressure (AP): The air pressure in millibars
  3. Humidity (RH): The moisture level in the air as a percentage
  4. Exhaust Vacuum (V): The vacuum pressure in the steam turbine (measured in cm Hg)

The main thing we want to predict is:

  1. Electrical Output (EP): How much power the plant generates each hour (in MW)

The power plant uses a two-step system to make electricity efficiently:

  1. Gas Turbine (GT): Burns fuel to generate electricity directly.

  2. Steam Turbine (ST): Uses leftover heat from the gas turbine.

By using the waste heat from the gas turbine to power the steam turbine, the plant gets more electricity from the same amount of fuel - making it more cost-effective and environmentally friendly than single-system plants.

Here, we are requeated to identify better mathematical model (from given 5 models) that can accurately predict a power plant’s electricity output based on given real-world conditions like temperature, humidity and others.

For this, first, import some essential libraries and install them if they haven't been already.

In [1]:
# Install and import packages: 

# Define all required packages: 
pkg_groups <- list(
  data_tools = list(
    packages = c("readr", "dplyr", "tidyr", "purrr", "reshape2", "zoo"),
    purpose = "Data I/O and manipulation"
    # readr: Fast CSV/TSV file reading and parsing  
    # dplyr: Core data manipulation (filter, mutate, summarize)  
    # tidyr: Data tidying (pivoting, nesting, handling missing values)  
    # purrr: Functional programming tools (map, reduce, list operations)  
    # reshape2: Convert data between wide/long formats (melt/dcast)  
    # zoo: Time-series operations (rolling windows, lag calculations) 
), 

  visualization = list(
    packages = c("ggplot2", "GGally", "viridis", "patchwork", "gridExtra", "corrplot"),
    purpose = "Data visualization"
    # ggplot2: Create elegant and customizable data visualizations  
    # GGally: Extended ggplot2 functionality (scatterplot matrices, etc.)  
    # viridis: Colorblind-friendly color scales for plots  
    # patchwork: Combine and arrange multiple ggplot2 plots  
    # gridExtra: Layout arrangements for grid-based graphics  
    # corrplot: Visualize correlation matrices with color coding 
), 

  stats = list(
    packages = c("car", "nortest", "moments", "MASS", "outliers"),
    purpose = "Statistical analysis"
    # car: Regression diagnostics and hypothesis testing  
    # nortest: Normality tests (Anderson-Darling, Shapiro-Wilk etc.)  
    # moments: Skewness, kurtosis and related moment calculations  
    # MASS: Modern applied statistics functions and datasets  
    # outliers: Detect and handle statistical outliers 
  ),

  utilities = list(
    packages = c("stringr", "knitr"),
    purpose = "Utility functions"
    # stringr: Consistent string manipulation and pattern matching  
    #knitr: Dynamic report generation and reproducible research  
  )
)

# install if not installed already and then load them: 
suppressPackageStartupMessages({
  suppressWarnings({
    for (group in pkg_groups) {
      message("[Loading] ", group$purpose, " packages...")
      for (pkg in group$packages) {
        if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
          install.packages(pkg, quiet = TRUE, repos = "https://cloud.r-project.org")
          library(pkg, character.only = TRUE, quietly = TRUE)
        }
      }
    }
  })
})

# Verify that the packages are loaded successfully 
loaded <- search()
message("All packages loaded successfully!")
[Loading] Data I/O and manipulation packages...

[Loading] Data visualization packages...

[Loading] Statistical analysis packages...

[Loading] Utility functions packages...

All packages loaded successfully!

In [2]:
# Setting default values like themes, colors, fonts and size etc for each plot:

# Set default plot size using repr
# width and height are in inches
options(jupyter.plot_mimetypes = "image/png") 
options(repr.plot.width = 20,   # Set default plot width
        repr.plot.height = 8)  # Set default plot height

# Define a custom ggplot2 theme for all plots:
my_custom_theme <- theme_minimal(base_size = 12) +   # Base theme with larger font
                   theme(
                     plot.title = element_text(face = "bold", hjust = 0.5),
                     axis.title = element_text(face = "bold"),
                     legend.position = "bottom"
                   )

# Set this theme as default for all plots
theme_set(my_custom_theme)

# Set default color scale or palette
scale_fill_continuous <- scale_fill_viridis_c   # use Viridis for continuous fill
scale_color_discrete <- scale_color_brewer(palette = "Set2")  # For discrete color scale

Import dataset using read_csv from readr package:

In [3]:
# Read the dataset using readr package:
data <- read_csv('dataset.csv', show_col_types = FALSE)
In [4]:
# Check the shape: rows and columns of the dataset:
nrow(data) 
ncol(data)
9568
5

The dataset contains 5 columns and more than 9500 rows (samples).

In [5]:
# Display the first few rows of the dataset
head(data)
A tibble: 6 × 5
x1x3x4x5x2
<dbl><dbl><dbl><dbl><dbl>
8.3440.771010.8490.01480.48
23.6458.491011.4074.20445.75
29.7456.901007.1541.91438.76
19.0749.691007.2276.79453.09
11.8040.661017.1397.20464.43
13.9739.161016.0584.60470.96

Here, the dataframe contains four independent variables $x_1$, $x_3$, $x_4$ and $x_5$. $x_2$ is the dependent variable:

  • x1: Ambient Temperature (°C) - it reperesents air temperature

  • x3: Ambient Pressure (mbar) - it reperesents atmospheric pressure

  • x4: Exhaust Vacuum (cm Hg) - it represents vacuum in steam turbine in cm Hg

  • x5: Relative Humidity (%) - it represents ambient humidity

  • x2: Net Energy Output (MW) - Hourly electrical energy generated in MW

In [6]:
# Check the dimensions of the dataset
dim(data)
  1. 9568
  2. 5

To get a better idea about the dataset and its distribution of values, lets calculate some important descriptive statistics:

In [7]:
# Generate a descriptive statistics table for the dataset:

# identify numeric columns
numeric_cols <- sapply(data, is.numeric)
numeric_data <- data[, numeric_cols]

# Calculate statistics for each column
stats_table <- data.frame(
  variable = names(numeric_data),
  mean = sapply(numeric_data, mean, na.rm = TRUE),
  median = sapply(numeric_data, median, na.rm = TRUE),
  min = sapply(numeric_data, min, na.rm = TRUE),
  max = sapply(numeric_data, max, na.rm = TRUE),
  sd = sapply(numeric_data, sd, na.rm = TRUE),
  q25 = sapply(numeric_data, function(x) quantile(x, 0.25, na.rm = TRUE)),
  q75 = sapply(numeric_data, function(x) quantile(x, 0.75, na.rm = TRUE)),
  skew = sapply(numeric_data, skewness, na.rm = TRUE),
  kurt = sapply(numeric_data, kurtosis, na.rm = TRUE)
)

# use knitr::kable to format the outout nicely: 
kable(stats_table, digits = 3, 
      caption = "Descriptive Statistics for Numeric Variables:")

Table: Descriptive Statistics for Numeric Variables:

|   |variable |     mean|   median|    min|     max|     sd|      q25|     q75|   skew|  kurt|
|:--|:--------|--------:|--------:|------:|-------:|------:|--------:|-------:|------:|-----:|
|x1 |x1       |   19.651|   20.345|   1.81|   37.11|  7.452|   13.510|   25.72| -0.136| 1.962|
|x3 |x3       |   54.306|   52.080|  25.36|   81.56| 12.708|   41.740|   66.54|  0.198| 1.556|
|x4 |x4       | 1013.259| 1012.940| 992.89| 1033.30|  5.939| 1009.100| 1017.26|  0.265| 3.094|
|x5 |x5       |   73.309|   74.975|  25.56|  100.16| 14.600|   63.328|   84.83| -0.432| 2.555|
|x2 |x2       |  454.365|  451.550| 420.26|  495.76| 17.067|  439.750|  468.43|  0.306| 1.951|
  • x1 (Ambient Temperature): Temperatures are evenly distributed around 20°C with moderate variability and minimal skew.

  • x3 (Ambient Pressure): Atmospheric pressure shows a slight right skew with an average around 54 mbar, it is mostly normal with a few higher readings.

  • x4 (Exhaust Vacuum): Exhaust Vacuum values are tightly sppread around 1013 with very little variation, suggesting stable conditions.

  • x5 (Relative Humidity): Relative Humidity values vary widely and is slightly left skewed, indicating some lower extreme valuues which cause the average down.

  • x2 (Net Energy Output): Energy output is consistently centered around 454 MW with a slight right skew which indicates a stable generation pattern with occasional higher outputs.

In [8]:
# Check data types of columns:
sapply(data, class)
x1
'numeric'
x3
'numeric'
x4
'numeric'
x5
'numeric'
x2
'numeric'

Now, check if the dataset contains any missing values or not:

In [9]:
# Check for missing values
sum(is.na(data))
0

So, no missing values detected.

Also, check if there any duplicating rows in the data:

In [10]:
# Check for duplicates
sum(duplicated(data))
41

So, there are 41 rows that are found to be duplicating ie. there are other 41 rows having similar data:

In [11]:
# View all duplicated rows (including their first occurrence)
dup_rows <- data[duplicated(data) | duplicated(data, fromLast = TRUE), ]
print(dup_rows)
# A tibble: 82 × 5
      x1    x3    x4    x5    x2
   <dbl> <dbl> <dbl> <dbl> <dbl>
 1 25.9   75.6 1018.  62.6  443.
 2 29.4   75.6 1018.  50.7  437.
 3 29.5   75.6 1018.  50.6  431.
 4 25.2   75.6 1017.  69.8  446.
 5  6.86  41.2 1020.  79.1  490.
 6 28.4   75.6 1017.  55.2  439.
 7 12.3   41.2 1019.  52.2  474.
 8 24.8   75.6 1017.  68.2  440.
 9  8.44  41.2 1020.  70.7  486.
10 24.6   59.6 1010.  85.4  441.
# ℹ 72 more rows
In [12]:
# Extract all duplicated rows including their first occurrences
dup_rows <- data[duplicated(data) | duplicated(data, fromLast = TRUE), ]

# Sort duplicate rows to group duplicates together 
dup_rows_sorted <- dup_rows[do.call(order, dup_rows), ]

print(dup_rows_sorted)
# A tibble: 82 × 5
      x1    x3    x4    x5    x2
   <dbl> <dbl> <dbl> <dbl> <dbl>
 1  6.03  41.2 1020.  84.2  489.
 2  6.03  41.2 1020.  84.2  489.
 3  6.06  41.2 1020.  84.7  490.
 4  6.06  41.2 1020.  84.7  490.
 5  6.4   41.2 1020.  85.4  489.
 6  6.4   41.2 1020.  85.4  489.
 7  6.86  41.2 1020.  79.1  490.
 8  6.86  41.2 1020.  79.1  490.
 9  7.24  41.2 1020.  80.6  487.
10  7.24  41.2 1020.  80.6  487.
# ℹ 72 more rows

This makes comparision of duplicates easy. Every consecutive pair of rows contains the same data.

In [13]:
# Count number of rows before dropping duplicates:
nrow(data)
9568

Now, clean the dataset by deleting duplicated samples:

In [14]:
# create a variable which contains only duplicated rows and their first occurrence:
duplicates_only <- duplicated(data)

# Drop only those duplicated rows
data <- data[!duplicates_only, ]
In [15]:
# Check the number of rows after removing duplicates
nrow(data)
9527

Now, add a new column naming time which acts as the time series data for our dataframe data.

In [16]:
# Add time column at the beginning of the dataset which acts as the time stamps
# The data assumbed to be recorded at an equal time interval of 1 units (1 hour)
data$time <- 1:nrow(data)


# Reorder the columns
data <- data[, c("time", "x1", "x3", "x4", "x5", "x2")] 
# x2 is the dependent variable (sometimes denoted as y) 
#and move x2 to the end of the dataset as per the general convention in machine learning 
In [17]:
# Check if the new column was successfully created or not
colnames(data)  # Shows all column names in the dataframe
  1. 'time'
  2. 'x1'
  3. 'x3'
  4. 'x4'
  5. 'x5'
  6. 'x2'
In [18]:
# check the data types of columns:
sapply(data, class)
time
'integer'
x1
'numeric'
x3
'numeric'
x4
'numeric'
x5
'numeric'
x2
'numeric'

All set.

Task-1: Preliminary Data Analysis:¶

Task-1.1: Time Series Analysis:¶

Following time-series-plot shows how each valriable (temperature, pressure, humidity, vacuum and energy output) changes over time.

In [19]:
# supress warnings temporarily:
options(warn = -1)

# Function to create individual time series plots:
create_ts_plot <- function(data, var, color, title, y_label) {
  ggplot(data, aes_string(x = "time", y = var)) +
    geom_line(color = color, linewidth = 1.2) +  
    labs(title = title,
         y = y_label,
         x = NULL) +
    theme_minimal(base_size = 14) + 
    theme(
      plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
      panel.grid.minor = element_blank(),
      axis.text = element_text(size = 12),
      axis.title = element_text(size = 14)
    )
}

# Create individual plots for each variable 
p1 <- create_ts_plot(data, "x1", "#1f77b4", "Ambient Temperature (x1)", "Temperature (°C)")
p2 <- create_ts_plot(data, "x2", "#ff7f0e", "Net Hourly Electrical Energy Output (x2) (Dependent Variable)", "Energy Output (MW)")
p3 <- create_ts_plot(data, "x3", "#2ca02c", "Ambient Pressure (x3)", "Pressure (Millibar)")
p4 <- create_ts_plot(data, "x4", "#d62728","Exhaust Vacuum (x5)", "Vacuum (cm Hg)") 
p5 <- create_ts_plot(data, "x5", "#9467bd", "Relative Humidity (x4)", "Humidity (%)")

# Add x-axis label only to the bottom plot
p5 <- p5 + labs(x = "Time (Units)") + 
  theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)))

# Combine all plots with patchwork
combined_plot <- p1 / p3 / p4 / p5 / p2 +
  plot_annotation(
    title = "Time Series of CCPP Variables",
    theme = theme(plot.title = element_text(size = 22, hjust = 0.5, face = "bold"))
  )

# set the dimensions for better viewing experience 
options(repr.plot.width = 20, repr.plot.height = 20)  # For Jupyter/IRkernel

# Display the combined plot
combined_plot

# Reset warnings back to default
options(warn = 0)
No description has been provided for this image

This displays five separate time series plots stacked vertically, each representing a different variable from a Combined Cycle Power Plant (CCPP) dataset. Along the x-axis we plot time scales (Hours) and along the y-axis, we show measured values for a specific variables such as temperature, energy output, humidity, pressure, or vacuum.

Each colored line (line plot) represents how a specific variable changes across the timeframe. The chart gives an overview of how each measurement fluctuates over the entire period.

Not much inference can be made from these time series plots because 6 years (2006-2011) of data is being displayed along the axes and high frequency of observations create a dense, noisy appearence.

To solve this problem and spot longer-term trends, let's defines a rolling window of 100 observations. This means, every smoothed value is the average of the surrounding 100 raw measurements.

In [20]:
head(data)
A tibble: 6 × 6
timex1x3x4x5x2
<int><dbl><dbl><dbl><dbl><dbl>
1 8.3440.771010.8490.01480.48
223.6458.491011.4074.20445.75
329.7456.901007.1541.91438.76
419.0749.691007.2276.79453.09
511.8040.661017.1397.20464.43
613.9739.161016.0584.60470.96
In [21]:
# supress warnings:
options(warn= -1)

# Define window size for the rolling average
window_size <- 100

# Apply rolling window to each variable
data_smoothed <- data %>%
  mutate(
    x1_smooth = rollmean(x1, k = window_size, fill = NA, align = "center"),
    x3_smooth = rollmean(x3, k = window_size, fill = NA, align = "center"),
    x4_smooth = rollmean(x4, k = window_size, fill = NA, align = "center"),
    x5_smooth = rollmean(x5, k = window_size, fill = NA, align = "center"),
    x2_smooth = rollmean(x2, k = window_size, fill = NA, align = "center")
  )

# Function to create plots with both raw and smoothed data
create_smoothed_plot <- function(data, raw_var, smooth_var, color, title, y_label) {
  ggplot(data, aes(x = time)) +
    # Raw data in background with transparency
    geom_line(aes(y = .data[[raw_var]]), color = color, alpha = 0.2, linewidth = 0.5) +
    # Smoothed data in foreground
    geom_line(aes(y = .data[[smooth_var]]), color = color, linewidth = 1.5) +
    labs(title = title,
         y = y_label,
         x = NULL) +
    theme_minimal(base_size = 14) +
    theme(
      plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
      panel.grid.minor = element_blank(),
      axis.text = element_text(size = 13),
      axis.title = element_text(size = 14),
      plot.margin = margin(t = 5, r = 10, b = 5, l = 10)
    )
}

# Create individual plots for each variable
p1 <- create_smoothed_plot(data_smoothed, "x1", "x1_smooth", "#1f77b4", 
                          "Ambient Temperature (x1)", "Temperature (°C)")

p3 <- create_smoothed_plot(data_smoothed, "x3", "x3_smooth", "#ff7f0e", 
                          "Ambient Pressure (x3)", "Pressure (millibar)")

p5 <- create_smoothed_plot(data_smoothed, "x5", "x5_smooth", "#2ca02c", 
                          "Relative Humidity (x5)", "Humidity (%)")

p4 <- create_smoothed_plot(data_smoothed, "x4", "x4_smooth", "#d62728", 
                          "Exhaust Vacuum (x4)", "Vacuum (cm Hg)")

p2 <- create_smoothed_plot(data_smoothed, "x2", "x2_smooth", "#9467bd", 
                          "Net Hourly Electrical Energy Output (x2) (Dependent Variable)", "Energy Output (MW)")

# Add x-axis label only to the bottom plot
p5 <- p5 + labs(x = "Time (hours)")

# Combine all plots with patchwork
combined_smoothed_plot <- p1 / p3 / p4 / p5 / p2 +
  plot_annotation(
    title = "Time Series of CCPP Variables (100-point Rolling Average):",
    subtitle = "Light lines show raw data; dark lines show smoothed trend",
    theme = theme(
      plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
      plot.subtitle = element_text(size = 14, hjust = 0.5)
    )
  )

# Display the combined plot
options(repr.plot.width = 20, repr.plot.height = 20)
combined_smoothed_plot

# Reset warnings back to default 
options(warn=0)
No description has been provided for this image

This plot shows the changing patterns of five different measurements from a power plant over time. Each plot has two lines: a faint, wiggly line showing the original ups and downs, and a darker, smoother line that reveals the general trend by averaging out 100 points at a time.

The smoothed rolling average line plot provide a general sense of how the variables behave over time. But still dense raw data adds noise and it makes interpertation very difficult.

To get a simpified high-level view of the time series data, I divide the dataset into segments where each segment consists of 200 consecutive observations and then calculate the average of each segment and plot these averages over time.

In [22]:
# Create a high-level view by segmenting the data
# Each segment consists of 200 consecutive observations

# supress warnings for this code block:
options(warn = -1)

# Define segment size:
segment_size <- 200 # Number of observations per segment
  
# Create a segment identifier column
data$segment <- ceiling(data$time / segment_size)
  
# Calculate average for each segment and variable
segment_averages <- data %>%
group_by(segment) %>%
summarize(
    time = mean(time),  # Use the middle time point of each segment
    x1_avg = mean(x1),
    x2_avg = mean(x2),
    x3_avg = mean(x3),
    x4_avg = mean(x4),
    x5_avg = mean(x5)
)
  
# Function to create segment average plots
create_segment_plot <- function(data, var, color, title, y_label) {
ggplot(data, aes(x = time, y = .data[[var]])) +
    # Add points for each segment average
    geom_point(color = color, size = 3) +
    # Connect points with lines
    geom_line(color = color, linewidth = 1) +
    # Add labels
    labs(title = title,
        y = y_label,
        x = NULL) +
      theme_minimal(base_size = 14) +
      theme(
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        panel.grid.minor = element_blank(),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 14),
        plot.margin = margin(t = 5, r = 10, b = 5, l = 10)
      )
  }
  
# Create individual plots for each variable
p1 <- create_segment_plot(segment_averages, "x1_avg", "#1f77b4", 
                           "Ambient Temperature (x1)", "Temperature (°C)")
p2 <- create_segment_plot(segment_averages, "x2_avg", "#ff7f0e", 
                           "Net Hourly Electrical Energy Output (x2)", "Energy Output (MW)")
p3 <- create_segment_plot(segment_averages, "x3_avg", "#2ca02c", 
                           "Ambient Pressure (x3)", "Pressure (milllibar)")
p5 <- create_segment_plot(segment_averages, "x5_avg", "#d62728", 
                           "Relative Humidity (x5)", "Humidity (%)")
p4 <- create_segment_plot(segment_averages, "x4_avg", "#9467bd", 
                           "Exhaust Vacuum (x4)", "Vacuum (cm Hg)")
  
# Add x-axis label only to the bottom plot
p5 <- p5 + labs(x = "Time (hours)")
  
# Combine all plots with patchwork
combined_segment_plot <- p1 / p3 / p4 / p5/ p2
    plot_annotation(
      title = "Time Series of CCPP Variables (200-point Segments)",
      subtitle = "Each point represents the average of 200 consecutive observations",
      theme = theme(
        plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(size = 14, hjust = 0.5)
      )
    )
  
# Display the combined plot
options(repr.plot.width = 20, repr.plot.height = 20)
print(combined_segment_plot)

# Reset warnings back to default
options(warn = 0)
$title
[1] "Time Series of CCPP Variables (200-point Segments)"

$subtitle
[1] "Each point represents the average of 200 consecutive observations"

$caption
list()
attr(,"class")
[1] "waiver"

$tag_levels
list()
attr(,"class")
[1] "waiver"

$tag_prefix
list()
attr(,"class")
[1] "waiver"

$tag_suffix
list()
attr(,"class")
[1] "waiver"

$tag_sep
list()
attr(,"class")
[1] "waiver"

$theme
List of 2
 $ plot.title   :List of 11
  ..$ family       : NULL
  ..$ face         : chr "bold"
  ..$ colour       : NULL
  ..$ size         : num 20
  ..$ hjust        : num 0.5
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi FALSE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ plot.subtitle:List of 11
  ..$ family       : NULL
  ..$ face         : NULL
  ..$ colour       : NULL
  ..$ size         : num 14
  ..$ hjust        : num 0.5
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi FALSE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 - attr(*, "class")= chr [1:2] "theme" "gg"
 - attr(*, "complete")= logi FALSE
 - attr(*, "validate")= logi TRUE

attr(,"class")
[1] "plot_annotation"
No description has been provided for this image

Ambient Temperature (x1) shows clear wave-like ups and downs, likely due to seasons. It changes the most over time.

Ambient Pressure (x3) stays mostly flat with small changes. Pressure is stable and doesn’t vary much.

Exhaust Vacuum (x4) fluctuates more sharply and often. Likely affected by how the plant is running or being maintained.

Relative Humidity (x5) changes moderately with some seasonal influence. Less dramatic than temperature or vacuum.

Net Hourly Electrical Energy Output (x2) fairly steady over time. Despite changes in other factors, the plant keeps output consistent.

Task 1.2: Histograms with Density Plots:¶

A histogram is a graphical tool used to understand the distribution of numerical variables. It organizes data into continuous intervals, or “bins,” and displays the frequency of values within each range using vertical bars. The height of each bar represents how many data points fall within that interval, making it easier to identify patterns such as skewness, concentration, or spread in the dataset. Histograms are particularly effective for revealing the underlying structure of the data, including central tendency, variability, and the presence of outliers.

A density plot is a smooth, continuous curve that estimates the probability distribution of a numerical variable. Unlike histograms that group data into bins, density plots use a technique called kernel density estimation (KDE) to create a fluid shape that reflects how data points are distributed across the range of values. This approach helps highlight features like peaks (modes), skewness, and the general shape of the data distribution. It’s particularly useful when comparing multiple distributions or when a clearer picture of data concentration is needed without the rigid binning structure of a histogram.

When combined with a histogram in a single plot, the density curve provides a more detailed and intuitive understanding of the data’s distribution. The histogram offers a straightforward count-based view of how frequently data values occur within certain ranges, while the density plot overlays a smoothed representation of the same distribution. This combination allows to quickly see both the exact frequencies and the overall shape of the distribution, making it easier to detect subtle patterns, assess normality, or identify multiple peaks that might indicate distinct subgroups within the data.

In [23]:
head(data)
A tibble: 6 × 7
timex1x3x4x5x2segment
<int><dbl><dbl><dbl><dbl><dbl><dbl>
1 8.3440.771010.8490.01480.481
223.6458.491011.4074.20445.751
329.7456.901007.1541.91438.761
419.0749.691007.2276.79453.091
511.8040.661017.1397.20464.431
613.9739.161016.0584.60470.961
In [24]:
# Create histogram plots with density curves for independent variables: x1, x3, x4, x5 

# Suppress warnings
options(warn = -1)

# Function to calculate mode
calculate_mode <- function(x) {
  # Create frequency table
  ux <- unique(x)
  # Find the value with highest frequency
  ux[which.max(tabulate(match(x, ux)))]
}

# Function to create histogram with density plot and statistics
create_hist_density_plot <- function(data, var, color, fill, title, x_label) {
  # Calculate statistics
  mean_val <- mean(data[[var]])
  median_val <- median(data[[var]])
  mode_val <- calculate_mode(round(data[[var]], 1))  # Round to 1 decimal for better mode calculation
  sd_val <- sd(data[[var]])
  skew_val <- moments::skewness(data[[var]])
  kurt_val <- moments::kurtosis(data[[var]]) - 3  # Excess kurtosis
  
  # Create the plot
  p <- ggplot(data, aes_string(x = var)) +
    # Add histogram
    geom_histogram(aes(y = after_stat(count)), 
                  bins = 30, 
                  fill = fill, 
                  color = "white", 
                  alpha = 0.7) +
    # Add density curve
    geom_density(aes(y = after_stat(count)), 
                color = color, 
                linewidth = 1) +
    # Add vertical lines for mean, median, and mode
    geom_vline(xintercept = mean_val, color = "red", linetype = "dashed", linewidth = 1) +
    geom_vline(xintercept = median_val, color = "green", linetype = "dashed", linewidth = 1) +
    geom_vline(xintercept = mode_val, color = "blue", linetype = "dotted", linewidth = 1) +
    # Add labels
    labs(title = title,
         x = x_label,
         y = "Frequency") +
    # Add statistics as text box
    annotate("text", 
            x = min(data[[var]]) - 0.05 * (max(data[[var]]) - min(data[[var]])), 
            y = 0.9 * max(ggplot_build(ggplot(data, aes_string(x = var)) + 
                                      geom_histogram(bins = 30))$data[[1]]$count),
            label = sprintf("Mean: %.2f\nMedian: %.2f\nStd Dev: %.2f\nSkewness: %.2f\nKurtosis: %.2f", 
                           mean_val, median_val, sd_val, skew_val, kurt_val),
            hjust = 0, 
            vjust = 1,
            size = 5,
            color = "black",
            fontface = "plain",
            family = "sans") +
    # Add theme
    theme_minimal(base_size = 14) +
    theme(
      plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
      panel.grid.minor = element_blank(),
      axis.text = element_text(size = 12),
      axis.title = element_text(size = 14)
    )
  
  return(p)
}

# Create plots for independent variables
p1 <- create_hist_density_plot(data, "x1", "#1f77b4", "#1f77b4", 
                              " (x1) Ambient Temperature (°C)", "Ambient Temperature (°C)")
p5 <- create_hist_density_plot(data, "x5", "#2ca02c", "#2ca02c", 
                              " (x5) Relative Humidity (%)", "Relative Humidity (%)")
p3 <- create_hist_density_plot(data, "x3", "#d62728", "#d62728", 
                              " (x3) Ambient Pressure (millibar)", "Ambient Pressure (millibar)")
p4 <- create_hist_density_plot(data, "x4", "#9467bd", "#9467bd", 
                              " (x4) Exhaust Vacuum (cm Hg)", "Exhaust Vacuum (cm Hg)")

# Arrange plots in a 2x2 grid
combined_hist_plot <- (p1 + p3) / (p4 + p5) +
  plot_annotation(
    title = "Distribution of Independent Variables",
    theme = theme(
      plot.title = element_text(size = 20, hjust = 0.5, face = "bold")
    )
  )

# Display the combined plot
options(repr.plot.width = 20, repr.plot.height = 8)
print(combined_hist_plot)

# Reset warnings
options(warn = 0)
No description has been provided for this image
  • Ambient Temperature (°C): The plot shows a bell-shaped, nearly symmetric distribution centered around 20°C, with most values between 10°C and 30°C and a slight tail on the cooler side.

  • Ambient Pressure (mbar): The distribution is tightly grouped around 54.29 mbar, nearly symmetric, and shows little variation, suggesting stable atmospheric pressure.

  • Exhaust Vacuum (cm Hg): The plot is centered near 1013 cm Hg with a shape close to normal and very little spread, indicating consistent exhaust vacuum levels.

  • Relative Humidity (%): The distribution is moderately left-skewed with values mainly between 60% and 90%, showing a wider range of humidity levels.

In [25]:
# Create histogran for dependent variable: x2 ie. Energy Output (MW): 

# Suppress warnings
options(warn = -1)

# Function to calculate mode: 
calculate_mode <- function(x) {
  # Create frequency table
  ux <- unique(x)
  # Find the value with highest frequency
  ux[which.max(tabulate(match(x, ux)))]
}

# Calculate statistics for x2
mean_val <- mean(data$x2)
median_val <- median(data$x2)
mode_val <- calculate_mode(round(data$x2, 1))
sd_val <- sd(data$x2)
skew_val <- skewness(data$x2)
kurt_val <- kurtosis(data$x2) - 3

# Get histogram data for scaling
hist_data <- hist(data$x2, plot = FALSE, breaks = 30)
max_count <- max(hist_data$counts)
bin_width <- hist_data$breaks[2] - hist_data$breaks[1]

# Create the plot
p_dep <- ggplot(data, aes(x = x2)) +
  geom_histogram(aes(y = after_stat(count)), bins = 30,
                 fill = "#ff7f0e", color = "white", alpha = 0.6) +
  
  geom_density(aes(y = after_stat(count), color = "Actual Density"), linewidth = 1) +
  
  stat_function(aes(color = "Normal Distribution"), fun = function(x) {
    scaling_factor <- nrow(data) * bin_width
    dnorm(x, mean = mean_val, sd = sd_val) * scaling_factor
  }, linewidth = 1, linetype = "dashed") +
  
  # Proper mappings to include in legend: 
  geom_vline(aes(xintercept = mean_val, color = "Mean"), linetype = "dashed", linewidth = 1) +
  geom_vline(aes(xintercept = median_val, color = "Median"), linetype = "dashed", linewidth = 1) +
  geom_vline(aes(xintercept = mode_val, color = "Mode"), linetype = "dotted", linewidth = 1.5) +
  
  labs(
    title = "Distribution of Dependent Variable:",
    subtitle = "Net Hourly Electrical Energy Output (MW)",
    x = "Energy Output (MW)",
    y = "Frequency",
    color = ""
  ) +
  
  annotate("text",
           x = min(data$x2) - 0.05 * (max(data$x2) - min(data$x2)),
           y = 0.9 * max_count,
           label = sprintf("Mean: %.2f\nMedian: %.2f\nMode: %.2f\nStd Dev: %.2f\nSkewness: %.2f\nKurtosis: %.2f",
                           mean_val, median_val, mode_val, sd_val, skew_val, kurt_val),
           hjust = 0,
           vjust = 0,
           size = 7,
           color = "black") +
  
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 16),
    panel.grid.minor = element_blank(),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    legend.position = c(0.95, 0.75),
    legend.justification = c("right", "top"),
    legend.background = element_rect(fill = "white", color = NA),
    legend.key.size = unit(1.0, "lines"),
    legend.text = element_text(size = 12)
  )

# Display
options(repr.plot.width = 20, repr.plot.height = 8)
print(p_dep)

# Re-enable warnings
options(warn = 0)
No description has been provided for this image

It shows the distribution of the dependent variable, Net Hourly Electrical Energy Output MW. The distribution is slightly right-skewed, with most values concentrated between 430 MW and 470 MW. The mean energy output is 454.37 MW, while the median is slightly lower at 451.55 MW, suggesting a slight skewness. The standard deviation of 17.07 MW indicates moderate variability in the energy output. The black dashed line represents a normal distribution for comparison, showing that the actual data deviates slightly from normality, with a flatter peak and longer tail on the right.

Task 1.3: Normality of Data:¶

Q-Q plot, or Quantile-Quantile plot, is a graphical tool used to assess whether a dataset follows a particular theoretical distribution; most commonly, the normal distribution. It works by plotting the quantiles of our data against the corresponding quantiles of a normal distribution. If the data is normally distributed, the points in the plot will roughly align along a straight diagonal line.

To interpret a Q-Q plot, simply examine how closely the plotted points follow the reference line. If the points lie mostly along the line with minimal deviation, it suggests that the data is approximately normal. Systematic deviations from the line can reveal specific types of departures from normality—for instance, a curve that bends away at the ends could indicate heavy tails (i.e., data with more extreme values than expected in a normal distribution), while a pronounced S-shape may point to skewness.

In [26]:
# # Create Q-Q plots for all variables to check their normality:

# Suppress warnings temporarily
options(warn = -1)

# Define variables, labels, and colors
variables <- c('x1', 'x3', 'x4', 'x5', 'x2')
labels <- c(
  ' (x1) Ambient Temperature (°C)', 
  ' (x3) Ambient Pressure (millibar)', 
  ' (x4) Exhaust Vacuum (cm Hg)', 
  '(x5) Relative Humidity (%)', 
  ' (x2) Net Hourly Electrical Energy Output (MW)'
)
colors <- c('#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd')

# Set plot layout: 2 rows and 3 columns
layout(matrix(1:6, nrow = 2, byrow = TRUE))

# Global plot parameters
par(
  mar = c(5, 5, 4.5, 2),  # Margins: bottom, left, top, right
  oma = c(0, 0, 5, 0),    # Outer margin for title
  bg = "white",
  col.axis = "black",
  col.lab = "black",
  col.main = "black",
  cex.axis = 1.3,         # Axis tick label size
  cex.lab = 1.5,          # Axis label size
  cex.main = 1.9          # Title size
)

# Generate Q-Q plots for each variable
for (i in seq_along(variables)) {
  var <- variables[i]
  label <- labels[i]
  color <- colors[i]
  
  # Create Q-Q plot
  qqnorm(data[[var]],
         main = label,
         col = adjustcolor(color, alpha.f = 0.75),
         pch = 19,          # Point character
         cex = 0.9)        # Point size
  
  # Add Q-Q line
  qqline(data[[var]], col = "red", lwd = 2)  # Q-Q line in red
  box(col = "black")                          # Box outline
}

# Placeholder for unused plot area
plot.new()

# Add a central title for all plots
mtext("Q-Q Plots for Normality Check (All Variables):", outer = TRUE, line = 2, cex = 2.2, font = 2)

# Reset graphical parameters to defaults
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1)

# Restore warnings
options(warn = 0)
No description has been provided for this image
  • Ambient Temperature (x1): S-shaped curve; clear deviation from the straight line; indicates moderate skewness and heavy tails (not perfectly normal).

  • Ambient Pressure (x3): Step-like pattern and strong deviation at both ends; distribution is not normal; likely discrete or has many repeated values.

  • Exhaust Vacuum (x4): Points closely follow the red line; values are nearly normal; distribution is symmetric and well-behaved.

  • Relative Humidity (x5): S-shaped curve, especially at the tails; shows skewness and deviation from normality.

  • Net Hourly Electrical Energy Output (x2): Slight deviation at both tails, but mostly follows the line; distribution is close to normal, with mild skewness.

Only Exhaust Vacuum (x4) is approximately normal; Temperature (x1) and Humidity (x5) show clear non-normality (skewness/heavy tails); Pressure (x3) is strongly non-normal; Energy Output (x2) is close to normal but with minor tail deviations.

Task 1.4: Outliers Detecting:¶

Box plots, also known as box-and-whisker plots, are a powerful way to visually summarize the distribution of a dataset. It provide a clear picture of the spread, center, and variability of the data by showing five key summary statistics: the minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The box represents the middle 50% of the data—from Q1 to Q3—with a line inside the box indicating the median. Extending from either end of the box are the “whiskers,” which typically reach up to 1.5 times the interquartile range (IQR) from the quartiles.

One of the most useful features of a box plot is its ability to highlight potential outliers. Any data points that fall outside the whiskers—specifically, below Q1 − 1.5×IQR or above Q3 + 1.5×IQR—are plotted as individual points and considered outliers. This helps in quickly identifying values that deviate significantly from the overall trend or pattern in the data. By doing so, box plots make it easier to detect unusual observations that might require further investigation or special handling in a data analysis workflow.

In [27]:
# Create box plots for all variables: 

# Custom theme specific to this plot: 
boxplot_theme <- function() {
  theme_minimal(base_size = 14) +
    theme(
      panel.grid.major = element_line(color = "grey90"),
      panel.grid.minor = element_blank(),
      plot.title = element_text(size = 18, hjust = 0.5, face = "bold"),
      axis.title = element_text(size = 14),
      axis.text = element_text(size = 12),
      plot.background = element_rect(fill = "white", color = "grey80"),
      panel.background = element_rect(fill = "grey95", color = "grey70"),
      panel.spacing = unit(1.5, "lines"),  # Adjusted spacing between rows
      plot.margin = margin(10, 15, 10, 15)
    )
}

# Function to detect outliers using the IQR method
detect_outliers_iqr <- function(data, column, k = 1.5) {
  q1 <- quantile(data[[column]], 0.25, na.rm = TRUE)
  q3 <- quantile(data[[column]], 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower_bound <- q1 - k * iqr
  upper_bound <- q3 + k * iqr
  outliers <- data[data[[column]] < lower_bound | data[[column]] > upper_bound, ]
  return(list(outliers = outliers, lower_bound = lower_bound, upper_bound = upper_bound))
}

# Prepare variables and labels
all_vars <- c('x1', 'x3', 'x4', 'x5', 'x2')
all_labels <- c('Ambient Temperature (°C) (x1)', 'Ambient Pressure (millibar)(x3)', 'Exhaust Vacuum (cm Hg)(x4)', 
                'Relative Humidity (%)(x5)', 'Net Hourly Electrical Energy Output (MW)(x2)')
all_colors <- viridis::viridis(5, option = "D")

# Create and store individual plots
plot_list <- lapply(seq_along(all_vars), function(i) {
  var <- all_vars[i]
  label <- all_labels[i]
  color <- all_colors[i]
  
  # Detect outliers
  outlier_result <- detect_outliers_iqr(data, var)
  
  # Create horizontal boxplots: 
  p <- ggplot(data, aes(x = !!sym(var), y = 1)) +
    geom_boxplot(fill = color, color = "black", alpha = 0.8, 
                  width = 0.2,  # Adjusted width for better visibility
                  outlier.size = 2) +
    labs(title = label) +
    boxplot_theme() +
    theme(axis.text.y = element_blank(),
          axis.title.y = element_blank(),
          axis.ticks.y = element_blank()) +
    scale_y_continuous(limits = c(0.8, 1.3))  # Adjusted limits for better spacing
  return(p)
})

# Arrange plots in 2 columns with separation
final_plot <- wrap_plots(plot_list, ncol = 2) + 
  plot_annotation(
    title = "Boxplots for Outlier Detection",
    theme = theme(plot.title = element_text(size = 22, face = "bold", hjust = 0.5))
  )

# Display the plot
final_plot
No description has been provided for this image
In [28]:
# Box plots with violin plots for all variables (dependent variable + independent variables):  

# Custom theme for the plots
boxplot_theme <- function() {
  theme_minimal(base_size = 14) +
    theme(
      panel.grid.major = element_line(color = "grey90"),
      panel.grid.minor = element_blank(),
      plot.title = element_text(size = 18, hjust = 0.5, face = "bold"),
      axis.title = element_text(size = 14),
      axis.text = element_text(size = 12),
      plot.background = element_rect(fill = "white", color = "grey80"),
      panel.background = element_rect(fill = "grey95", color = "grey70"),
      panel.spacing = unit(1.5, "lines"),  # Adjust the spacing between plots
      plot.margin = margin(10, 15, 10, 15)
    )
}

# Function to detect outliers using the IQR method
detect_outliers_iqr <- function(data, column, k = 1.5) {
  q1 <- quantile(data[[column]], 0.25, na.rm = TRUE)
  q3 <- quantile(data[[column]], 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower_bound <- q1 - k * iqr
  upper_bound <- q3 + k * iqr
  outliers <- data[data[[column]] < lower_bound | data[[column]] > upper_bound, ]
  return(list(outliers = outliers, lower_bound = lower_bound, upper_bound = upper_bound))
}

# Prepare variables and labels
all_vars <- c('x1', 'x3', 'x4', 'x5', 'x2')
all_labels <- c('(x1) Ambient Temperature (°C)', ' (x3) Ambient Pressure (millibar)', '(x4) Exhaust Vacuum (cm Hg)', 
                '(x5) Relative Humidity (%)', '(x2) Net Hourly Electrical Energy Output (MW)')
all_colors <- viridis::viridis(5, option = "D")

# Create and store individual plots
plot_list <- lapply(seq_along(all_vars), function(i) {
  var <- all_vars[i]
  label <- all_labels[i]
  color <- all_colors[i]

  # Detect outliers for the current variable
  outlier_result <- detect_outliers_iqr(data, var)

  # Create combined boxplot and violin plot
  p <- ggplot(data, aes(x = !!sym(var), y = 1)) +
    geom_violin(aes(y = 1), fill = color, alpha = 0.4) +  # Adds violin plot
    geom_boxplot(aes(y = 1), color = "black", alpha = 0.8, 
                  width = 0.2, outlier.size = 2) +  # Adds boxplot
    labs(title = label) +
    boxplot_theme() +
    theme(axis.text.y = element_blank(),
          axis.title.y = element_blank(),
          axis.ticks.y = element_blank())

  return(p)
})

# Arrange plots in 2 columns with separation
final_plot <- wrap_plots(plot_list, ncol = 2) + 
  plot_annotation(
    title = "Box + Violin Plots for Outlier Detection:",
    theme = theme(plot.title = element_text(size = 22, face = "bold", hjust = 0.5))
  )

# Display the plot
final_plot
No description has been provided for this image

The distribution for (x1) Ambient Temperature is roughly symmetric and bell-shaped, with most values between 10°C and 30°C. There are no extreme outliers, and the data is moderately spread around the median.

The distribution for (x3) Ambient Pressure is slightly right-skewed, with a concentration of values around the center. There are a few mild outliers on both ends, but most data falls within a narrow range.

The distribution for (x4) Exhaust Vacuum is tight and symmetric, centered near the median. There are several outliers on both the low and high ends, but the majority of values are closely grouped.

The distribution for (x5) Relative Humidity is left-skewed, with most values between 60% and 100%. There are some low-end outliers, but the bulk of the data is concentrated at higher humidity levels.

The distribution for (x2) Net Hourly Electrical Energy Output is slightly right-skewed, with most values between 430 MW and 470 MW. There are no significant outliers, and the data is fairly symmetrically spread around the median.

In [29]:
# Create a dictionary of outliers for all variables
outliers_dict <- list()

# Detect and store outliers for each variable
for (var in all_vars) {
  # Get outliers using the IQR method
  outlier_result <- detect_outliers_iqr(data, var)
  
  # Store the outliers in the dictionary
  outliers_dict[[var]] <- outlier_result$outliers
}

# Print outliers for each variable
cat("Summary of Outliers by Variable:\n")
cat(strrep("-", 50), "\n")

# Loop through each variable to report on outliers
for (var in all_vars) {
  outliers <- outliers_dict[[var]]
  
  cat(sprintf("\nOutliers in %s:\n", var))
  cat(sprintf("Number of outliers: %d (%.2f%% of data)\n", nrow(outliers), (nrow(outliers) / nrow(data)) * 100))
  
  if (nrow(outliers) > 0) {
    cat("\nSample of outliers (first 10 if available):\n")
    print(head(outliers[, c(var, "time")], 10))
    
    # Calculate statistics for outliers vs non-outliers
    non_outliers <- data[!rownames(data) %in% rownames(outliers), ]
    
    cat(sprintf("\nOutlier statistics for %s:\n", var))
    cat(sprintf("Outlier range: %.2f to %.2f\n", min(outliers[[var]], na.rm = TRUE), max(outliers[[var]], na.rm = TRUE)))
    cat(sprintf("Non-outlier range: %.2f to %.2f\n", min(non_outliers[[var]], na.rm = TRUE), max(non_outliers[[var]], na.rm = TRUE)))
    cat(sprintf("Dataset mean: %.2f, Outlier mean: %.2f\n", mean(data[[var]], na.rm = TRUE), mean(outliers[[var]], na.rm = TRUE)))
    cat(sprintf("Dataset std: %.2f, Outlier std: %.2f\n", sd(data[[var]], na.rm = TRUE), sd(outliers[[var]], na.rm = TRUE)))
  }
  
  cat(strrep("-", 50), "\n")
}

# Create a data frame with all outliers combined
all_outliers <- bind_rows(outliers_dict)

# Get unique outliers (rows that are outliers in at least one variable)
unique_outliers <- distinct(all_outliers)

cat(sprintf("\nTotal unique outlier rows: %d (%.2f%% of data)\n", nrow(unique_outliers), (nrow(unique_outliers) / nrow(data)) * 100))

# Save outliers to a CSV file for further analysis if needed
write.csv(unique_outliers, 'outliers.csv', row.names = FALSE)
Summary of Outliers by Variable:
-------------------------------------------------- 
-------------------------------------------------- 

Outliers in x1:
Number of outliers: 0 (0.00% of data)
-------------------------------------------------- 

Outliers in x3:
Number of outliers: 0 (0.00% of data)
-------------------------------------------------- 

Outliers in x4:
Number of outliers: 91 (0.96% of data)

Sample of outliers (first 10 if available):
# A tibble: 10 × 2
      x4  time
   <dbl> <int>
 1  994.    31
 2 1030.   354
 3 1033.   414
 4 1031.   522
 5  996.  1016
 6  995.  1118
 7 1033.  1237
 8 1029.  1368
 9 1032.  1540
10 1032.  1565

Outlier statistics for x4:
Outlier range: 992.89 to 1033.30
Non-outlier range: 992.89 to 1033.30
Dataset mean: 1013.24, Outlier mean: 1024.88
Dataset std: 5.94, Outlier std: 13.95
-------------------------------------------------- 

Outliers in x5:
Number of outliers: 13 (0.14% of data)

Sample of outliers (first 10 if available):
# A tibble: 10 × 2
      x5  time
   <dbl> <int>
 1  25.6  1836
 2  31.2  1992
 3  31.0  2483
 4  28.2  3547
 5  28.8  3906
 6  30.6  4442
 7  26.3  4685
 8  29.9  5059
 9  25.9  5780
10  29.4  6292

Outlier statistics for x5:
Outlier range: 25.56 to 31.15
Non-outlier range: 25.56 to 100.16
Dataset mean: 73.33, Outlier mean: 28.81
Dataset std: 14.61, Outlier std: 2.08
-------------------------------------------------- 

Outliers in x2:
Number of outliers: 0 (0.00% of data)
-------------------------------------------------- 

Total unique outlier rows: 104 (1.09% of data)

Now, so, I have successfully filtered out the outliers from the dataset and saved them in a separate csv file.

In [30]:
# Import outliers from outliers.csv:
outliers <- read.csv('outliers.csv')

# Display the first few rows of the outliers dataset
head(outliers)
A data.frame: 6 × 7
timex1x3x4x5x2segment
<int><dbl><dbl><dbl><dbl><dbl><int>
1 3113.4241.23 994.1795.79468.821
2 354 9.4241.401029.6087.43478.122
3 414 9.8041.821032.9867.55473.723
4 522 9.2441.401030.7287.36476.903
5101616.2141.23 995.8880.00463.096
6111814.1341.23 994.6093.02467.846
In [31]:
# Check the dimensions of the outliers dataset
dim(outliers)
  1. 104
  2. 7

So, dataset contains 104 outlier data samples out of 9568 rows.

Outliers that we have detected actually represents rare but important operational conditions. Also, power plants occasionally experience such unexpected events that can lead to extreme values in the dataset but they are all valid observations. Including them in our model makes a lot of sense and their inclusion makes the model more robust and reflective of real-world conditions.

Task 1. 5: Scatter Plots and Scatter Plot Matrix:¶

Scatter Plot Matrix (SPLOM) or pair plot is a grid of scatter plots that displays a pairwise relation between different variables in our dataset. Each cell in the matrix represents a scatter plot between two variables. The diagonal cells contains histograms or density plots and shows the distribution of each variable.

Scatter Plot Matrix is very useful in identifying the correlation between different variables, checking linearity-non-linearity between variables, in identifying potential outliers in the datset, helps to find out potential clusters in the dataset, helps in selecting the most prominent variables (Dimensionality Reduction) for analysis, and in finding multicollinearity between variables (strong linear relationship between perdictor variables indicates multicollinearity).

In [32]:
# Create a Scatter Plot Matrix (SPLOM) for all variables
# Suppress warnings
options(warn = -1)

# Create a custom theme for the scatter plot matrix
custom_theme <- theme_minimal() +
  theme(
    axis.text = element_text(size = 8),
    axis.title = element_text(size = 10, face = "bold"),
    strip.text = element_text(size = 10, face = "bold"),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray95"),
    panel.border = element_rect(fill = NA, color = "gray80")
  )

# Set theme
theme_set(custom_theme)

# Select variables for the scatter plot matrix
plot_vars <- c("x1", "x3", "x4", "x5", "x2")
plot_data <- data[, plot_vars]

# Create nicer variable labels
var_labels <- c(" (x1) Ambient\nTemperature (°C)", 
                " (x3) Ambient\nPressure (mb)",
                " (x4) Exhaust\nVacuum (cmHg)",
                " (x5) Relative\nHumidity (%)", 
                " (x2) Energy\nOutput (MW)")

names(plot_data) <- var_labels

# Create the scatter plot matrix with ggpairs
splom <- ggpairs(
  plot_data,
  upper = list(
    continuous = wrap("cor", size = 3, color = "black")
  ),
  lower = list(
    continuous = function(data, mapping, ...) {
      ggplot(data, mapping) +
        geom_point(alpha = 0.3, size = 0.8, color = "#4477AA") +
        geom_smooth(method = "lm", color = "#EE6677", se = FALSE, linewidth = 0.8, message= FALSE, ...)
    }
  ),
  diag = list(
    continuous = function(data, mapping, ...) {
      ggplot(data, mapping) +
        geom_density(fill = "#4477AA", alpha = 0.7, ...)
    }
  )
) +
  ggtitle("Scatter Plot Matrix of CCPP Variables:") +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.text = element_text(size = 7),
    strip.text = element_text(size = 9)
  )

# Print the scatter plot matrix
print(splom)

# Reset warnings
options(warn = 0)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image
  • Diagonal plots: show the distribution of each variable. Most are roughly bell-shaped, but (pressure and energy output) show multiple peaks or slight skewness.

  • Temperature (x1) vs. Energy Output (x2): Very strong negative linear relationship (correlation = -0.948), as temperature increases, energy output drops sharply.

  • Relative Humidity (x5) vs. Energy Output (x2): Strong negative relationship (correlation = -0.870). Higher humidity is linked to lower energy output.

  • Exhaust Vacuum (x4) vs. Energy Output (x2): Moderate positive relationship (correlation = 0.519). Higher vacuum tends to increase energy output.

  • Ambient Pressure (x3) vs. Energy Output (x2): Weak positive relationship (correlation = 0.391). Slight trend of higher pressure with higher output.

  • Temperature (x1) vs. Pressure (x3): Strong positive correlation (0.844). Higher temperatures often come with higher pressure.

  • Temperature (x1) vs. Humidity (x5): Strong negative correlation (-0.544). Hotter conditions usually mean lower humidity.

Temperature and humidity are the most influential factors (strongest correlations with output). Most relationships are linear, as shown by the straight trend lines. No major outliers or clusters, but some variables show non-normal distributions.

Task 1.6: Correlation Analysis:¶

Correlation is a way to measure how two variables move in relation to each other, whether variables tend to change together in a predictable way or not. The correlation coefficient quantifies this relationship on a scale from -1 to +1. A value of -1 indicates a perfect negative relationship (one goes up, the other goes down), +1 indicates a perfect positive relationship (both move in the same direction), and 0 suggests no relationship at all.

For correlations between independent and dependent variables (like temperature and energy output), stronger correlations are generally beneficial. They indicate that an input variable contains useful information for predicting the output. In CCPP data, the strong negative correlation between temperature and energy output suggests temperature will be a powerful predictor.

However, strong correlations between independent variables themselves (between temperature and humidity) can create a challenge called multicollinearity. When predictor variables are highly correlated with each other, it becomes difficult for the model to determine the individual contribution of each variable. This doesn't necessarily reduce the model's overall predictive power, but it can make the model less interpretable and potentially less stable when applied to new data.

In [33]:
# Pearson Correlation coefficient: 

# Select only the numeric variables (excluding time) and any other columns:
numeric_vars <- c("x1", "x3", "x4", "x5", "x2")
numeric_data <- data[, numeric_vars]

# Calculate Pearson correlation matrix
pearson_corr <- cor(numeric_data, method = "pearson")

# Round to 3 decimal places for better readability
pearson_corr_rounded <- round(pearson_corr, 3)

# Print the correlation matrix
cat("Pearson Correlation Matrix:\n")
print(pearson_corr_rounded)
Pearson Correlation Matrix:
       x1     x3     x4     x5     x2
x1  1.000  0.844 -0.508 -0.544 -0.948
x3  0.844  1.000 -0.416 -0.312 -0.870
x4 -0.508 -0.416  1.000  0.102  0.519
x5 -0.544 -0.312  0.102  1.000  0.391
x2 -0.948 -0.870  0.519  0.391  1.000
       x1     x3     x4     x5     x2
x1  1.000  0.844 -0.508 -0.544 -0.948
x3  0.844  1.000 -0.416 -0.312 -0.870
x4 -0.508 -0.416  1.000  0.102  0.519
x5 -0.544 -0.312  0.102  1.000  0.391
x2 -0.948 -0.870  0.519  0.391  1.000
In [34]:
# Create a data frame with variable pairs and their correlation values
# This makes it easier to sort and find the strongest correlations
corr_df <- data.frame(
  var1 = character(),
  var2 = character(),
  correlation = numeric(),
  stringsAsFactors = FALSE
)

# Fill the data frame with all variable pairs and their correlations
for (i in 1:(length(numeric_vars) - 1)) {
  for (j in (i + 1):length(numeric_vars)) {
    corr_df <- rbind(corr_df, data.frame(
      var1 = numeric_vars[i],
      var2 = numeric_vars[j],
      correlation = pearson_corr[i, j],
      stringsAsFactors = FALSE
    ))
  }
}

# Sort by absolute correlation value (descending)
corr_df <- corr_df %>%
  mutate(abs_corr = abs(correlation)) %>%
  arrange(desc(abs_corr))

# Print the sorted correlations
cat("\nSorted Correlations (by absolute value):\n")
print(corr_df[, c("var1", "var2", "correlation")], row.names = FALSE)
Sorted Correlations (by absolute value):
 var1 var2 correlation
   x1   x2  -0.9479079
   x3   x2  -0.8699002
   x1   x3   0.8436886
   x1   x5  -0.5439469
   x4   x2   0.5186868
   x1   x4  -0.5082216
   x3   x4  -0.4157184
   x5   x2   0.3911750
   x3   x5  -0.3122140
   x4   x5   0.1016310
 var1 var2 correlation
   x1   x2  -0.9479079
   x3   x2  -0.8699002
   x1   x3   0.8436886
   x1   x5  -0.5439469
   x4   x2   0.5186868
   x1   x4  -0.5082216
   x3   x4  -0.4157184
   x5   x2   0.3911750
   x3   x5  -0.3122140
   x4   x5   0.1016310
In [35]:
# Find the strongest correlation with the output variable (x2)
x2_correlations <- data.frame(
  variable = numeric_vars[numeric_vars != "x2"],
  correlation_with_x2 = pearson_corr["x2", numeric_vars[numeric_vars != "x2"]],
  stringsAsFactors = FALSE
) %>%
  mutate(abs_corr = abs(correlation_with_x2)) %>%
  arrange(desc(abs_corr))

# Print correlations with the output variable
cat("\nCorrelations with Output Variable (x2):\n")
print(x2_correlations[, c("variable", "correlation_with_x2")], row.names = FALSE)
Correlations with Output Variable (x2):
 variable correlation_with_x2
       x1          -0.9479079
       x3          -0.8699002
       x4           0.5186868
       x5           0.3911750
 variable correlation_with_x2
       x1          -0.9479079
       x3          -0.8699002
       x4           0.5186868
       x5           0.3911750
In [36]:
# Create a correlation heatmap showing only the lower triangle
# Load required libraries
library(ggplot2)
library(reshape2)
library(viridis)

# Create a mask for the upper triangle
mask <- upper.tri(pearson_corr_rounded, diag = TRUE)  # Include diagonal in mask

# Melt the correlation matrix for ggplot: 
melted_corr <- melt(pearson_corr_rounded)
names(melted_corr) <- c("Var1", "Var2", "Correlation")

# Apply the mask to keep only lower triangle
melted_corr <- melted_corr[!mask[cbind(match(melted_corr$Var1, numeric_vars), 
                                      match(melted_corr$Var2, numeric_vars))], ]

# Create nicer variable labels
var_labels <- c(" (x1)", 
                "(x3)", 
                "(x4)", 
                "(x5)",
                "(x2)")

# Replace variable names with nicer labels
melted_corr$Var1 <- factor(melted_corr$Var1, 
                          levels = numeric_vars,
                          labels = var_labels)
melted_corr$Var2 <- factor(melted_corr$Var2, 
                          levels = numeric_vars,
                          labels = var_labels)

# Create the heatmap: 
gg_heatmap <- ggplot(melted_corr, aes(x = Var2, y = Var1, fill = Correlation)) +
  geom_tile(color = "white", size = 0.5) +
  # Add correlation values: 
  geom_text(aes(label = sprintf("%.2f", Correlation)), 
            color = "black", size = 5) +  # Increased text size
  # Use a diverging color palette centered at 0
  scale_fill_gradient2(low = "#4575b4", mid = "white", high = "#d73027",
                      midpoint = 0, limit = c(-1, 1), name = "Pearson\nCorrelation") +
  # Add labels and title
  labs(title = "Pearson Correlation Matrix:",
    ) +
  # Adjust theme for better plot views: 
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 14, face = "bold"),  # Larger font
    axis.text.y = element_text(size = 14, face = "bold"),  # Larger font
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid = element_blank(),
    legend.position = "right",
    legend.text = element_text(size = 12),  # Larger legend text
    legend.title = element_text(size = 14),  # Larger legend title
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),  # Larger title
    plot.subtitle = element_text(size = 14, hjust = 0.5),  # Larger subtitle
    plot.margin = margin(20, 20, 20, 20)
  ) +
  # Turn off the grid
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  # Ensure the plot is square
  coord_fixed()
# Display the heatmap
print(gg_heatmap)
Warning message:
"Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead."
No description has been provided for this image
  • Strong negative correlation between ambient temperature (x1) and energy output (x2); so as the temperature increases, energy output drops sharply.

  • Strong negative correlation between relative humidity (x5) and energy output (x2); higher humidity is linked to lower energy output.

  • Strong positive correlation between ambient temperature (x1) and ambient pressure (x3); higher temperatures often come with higher pressure.

  • Moderate positive correlation between exhaust vacuum (x4) and Energy Output (x2); higher vacuum tends to increase energy output.

Temperature and humidity are the most influential factors (strongest correlations with output). Temperature and pressure are also strongly related.

Spearman's rank correlation can be used for describing non-linear relationships between variables. This measures monotonic relationships (where variables tend to change together, but not necessarily at a constant rate) rather than just linear ones, and it's also less sensitive to outliers.

In [37]:
# Calculate Spearman Rank Correlation Matrix
# Load required libraries
library(dplyr)

# Select only the numeric variables (excluding time)
numeric_vars <- c("x1", "x3", "x4", "x5", "x2")
numeric_data <- data[, numeric_vars]

# Calculate Spearman correlation matrix
spearman_corr <- cor(numeric_data, method = "spearman")

# Round to 3 decimal places for better readability
spearman_corr_rounded <- round(spearman_corr, 3)

# Print the correlation matrix
cat("Spearman Rank Correlation Matrix:\n")
print(spearman_corr_rounded)
Spearman Rank Correlation Matrix:
       x1     x3     x4     x5     x2
x1  1.000  0.850 -0.520 -0.544 -0.944
x3  0.850  1.000 -0.429 -0.305 -0.884
x4 -0.520 -0.429  1.000  0.089  0.543
x5 -0.544 -0.305  0.089  1.000  0.391
x2 -0.944 -0.884  0.543  0.391  1.000
       x1     x3     x4     x5     x2
x1  1.000  0.850 -0.520 -0.544 -0.944
x3  0.850  1.000 -0.429 -0.305 -0.884
x4 -0.520 -0.429  1.000  0.089  0.543
x5 -0.544 -0.305  0.089  1.000  0.391
x2 -0.944 -0.884  0.543  0.391  1.000
In [38]:
# Create a data frame with variable pairs and their correlation values
# This makes it easier to sort and find the strongest correlations
corr_df <- data.frame(
  var1 = character(),
  var2 = character(),
  correlation = numeric(),
  stringsAsFactors = FALSE
)

# Fill the data frame with all variable pairs and their correlations
for (i in 1:(length(numeric_vars) - 1)) {
  for (j in (i + 1):length(numeric_vars)) {
    corr_df <- rbind(corr_df, data.frame(
      var1 = numeric_vars[i],
      var2 = numeric_vars[j],
      correlation = spearman_corr[i, j],
      stringsAsFactors = FALSE
    ))
  }
}

# Sort by absolute correlation value (descending)
corr_df <- corr_df %>%
  mutate(abs_corr = abs(correlation)) %>%
  arrange(desc(abs_corr))

# Print the sorted correlations
cat("\nSorted Spearman Correlations (by absolute value):\n")
print(corr_df[, c("var1", "var2", "correlation")], row.names = FALSE)
Sorted Spearman Correlations (by absolute value):
 var1 var2 correlation
   x1   x2 -0.94434106
   x3   x2 -0.88400441
   x1   x3  0.85035481
   x1   x5 -0.54405910
   x4   x2  0.54301563
   x1   x4 -0.51971579
   x3   x4 -0.42888300
   x5   x2  0.39109517
   x3   x5 -0.30485345
   x4   x5  0.08911092
 var1 var2 correlation
   x1   x2 -0.94434106
   x3   x2 -0.88400441
   x1   x3  0.85035481
   x1   x5 -0.54405910
   x4   x2  0.54301563
   x1   x4 -0.51971579
   x3   x4 -0.42888300
   x5   x2  0.39109517
   x3   x5 -0.30485345
   x4   x5  0.08911092
In [39]:
# Find the strongest correlation with the output variable (x2)
x2_correlations <- data.frame(
  variable = numeric_vars[numeric_vars != "x2"],
  correlation_with_x2 = spearman_corr["x2", numeric_vars[numeric_vars != "x2"]],
  stringsAsFactors = FALSE
) %>%
  mutate(abs_corr = abs(correlation_with_x2)) %>%
  arrange(desc(abs_corr))

# Print correlations with the output variable
cat("\nSpearman Correlations with Output Variable (x2):\n")
print(x2_correlations[, c("variable", "correlation_with_x2")], row.names = FALSE)
Spearman Correlations with Output Variable (x2):
 variable correlation_with_x2
       x1          -0.9443411
       x3          -0.8840044
       x4           0.5430156
       x5           0.3910952
 variable correlation_with_x2
       x1          -0.9443411
       x3          -0.8840044
       x4           0.5430156
       x5           0.3910952
In [40]:
# Create a Spearman correlation heatmap:

# Create a mask for the upper triangle
mask <- upper.tri(spearman_corr_rounded, diag = TRUE)  # Include diagonal in mask

# Melt the correlation matrix for ggplot, applying the mask
melted_corr <- melt(spearman_corr_rounded)
names(melted_corr) <- c("Var1", "Var2", "Correlation")

# Apply the mask to keep only lower triangle
melted_corr <- melted_corr[!mask[cbind(match(melted_corr$Var1, numeric_vars), 
                                      match(melted_corr$Var2, numeric_vars))], ]

# Create nicer variable labels
var_labels <- c("(x1)", 
                "(x3)",
                "(x4)", 
                "(x5)", 
                 "(x2)")

# Replace variable names with nicer labels
melted_corr$Var1 <- factor(melted_corr$Var1, 
                          levels = numeric_vars,
                          labels = var_labels)
melted_corr$Var2 <- factor(melted_corr$Var2, 
                          levels = numeric_vars,
                          labels = var_labels)

# Create the heatmap with larger font sizes
gg_heatmap <- ggplot(melted_corr, aes(x = Var2, y = Var1, fill = Correlation)) +
  geom_tile(color = "white", size = 0.5) +
  # Add correlation values as text with larger font
  geom_text(aes(label = sprintf("%.2f", Correlation)), 
            color = "black", size = 5) +  # Increased text size
  # Use a diverging color palette centered at 0
  scale_fill_gradient2(low = "#4575b4", mid = "white", high = "#d73027",
                      midpoint = 0, limit = c(-1, 1), name = "Spearman\nCorrelation") +
  # Add labels and title
  labs(title = "Spearman Rank Correlation Heatmap:",
    ) +
  # Adjust theme with larger fonts
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 14, face = "bold"),  # Larger font
    axis.text.y = element_text(size = 14, face = "bold"),  # Larger font
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid = element_blank(),
    legend.position = "right",
    legend.text = element_text(size = 12),  # Larger legend text
    legend.title = element_text(size = 14),  # Larger legend title
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),  # Larger title
    plot.subtitle = element_text(size = 14, hjust = 0.5),  # Larger subtitle
    plot.margin = margin(20, 20, 20, 20)
  ) +
  # Turn off the grid
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  # Ensure the plot is square
  coord_fixed()

# Display the heatmap
print(gg_heatmap)
No description has been provided for this image

Very strong negative monotonic correlation between ambient temperature (x1) and energy output (x2); as temperature increases, energy output consistently decreases.

Very strong negative monotonic correlation between relative humidity (x5) and energy output (x2); higher humidity is strongly linked to lower energy output.

Very strong positive monotonic correlation between ambient temperature (x1) and ambient pressure (x3); H higher temperatures are closely associated with higher pressure.

Moderate positive monotonic correlation between exhaust Vacuum (x4) and energy output (x2); Higher vacuum generally increases energy output.

Temperature and humidity are the most influential factors for energy output (strongest monotonic correlations). Temperature and pressure also show a strong monotonic relationship.

Task 1.7: Hypothesis Testing:¶

Hypothesis testing is a structured way to make decisions based on data rather than our feelings. It is a scientific approach to answering yes/no questions about our data.

In hypothesis testing, we start with a question like "Are temperature and energy output related?" Then we form a hypothesis - starting with the assumption that there's no relationship (the null hypothesis). Then collect and analyze data to see if there's enough evidence to reject this initial assumption.

The beauty of hypothesis testing is that it gives us a measure of confidence in our conclusions. When a test returns a small p-value (typically less than 0.05), it's telling us that "the pattern we're seeing is very unlikely to have occurred by random chance." This gives a solid ground to stand on when making decisions.

For our work, hypothesis testing can help us to:

  • Confirm which variables truly influence energy output

  • Determine if the relationships we're seeing in scatter plots are statistically meaningful

  • Identify which patterns are reliable enough to include in our predictive models

  • Make objective decisions about which model structure might be most appropriate

Rather than just eyeballing our data and making subjective judgments, hypothesis testing provides a rigorous framework that quantifies uncertainty and helps us avoid being misled by random fluctuations or coincidental patterns in our dataset.

1.7.1: Hypothesis Testing for Relationship between Temperature (x1) and Energy Output (x2):¶
  • Null Hypothesis $H₀$: There is no significant relationship between temperature (x1) and energy output (x2).

  • Alternative Hypothesis $H₁$: There is a significant relationship between temperature (x1) and energy output (x2).

In [41]:
# Hypothesis Testing using Kendall's Tau: 

# Define significance level
alpha <- 0.05

# Create a function to perform Kendall's Tau test and format results
perform_kendall_test <- function(x, y, var_x_name, var_y_name) {
  # Perform Kendall's Tau test
  test_result <- cor.test(x, y, method = "kendall", exact = FALSE)
  
  # Extract results
  tau <- test_result$estimate
  p_value <- test_result$p.value
  
  # Determine significance
  is_significant <- p_value < alpha
  decision <- ifelse(is_significant, "Reject H₀", "Fail to reject H₀")
  
  # Return formatted results
  return(list(
    var_x = var_x_name,
    var_y = var_y_name,
    tau = tau,
    p_value = p_value,
    is_significant = is_significant,
    decision = decision
  ))
}

# Test relationship between Temperature (x1) and Energy Output (x2)
cat("Hypothesis Test: Temperature and Energy Output Relationship\n")
cat("------------------------------------------------------------\n")
cat("H₀: There is no significant monotonic relationship between temperature (x1) and energy output (x2)\n")
cat("H₁: There is a significant monotonic relationship between temperature (x1) and energy output (x2)\n")
cat("\nResults:\n")

x1_x2_test <- perform_kendall_test(data$x1, data$x2, "Temperature (x1)", "Energy Output (x2)")

cat(sprintf("Kendall's Tau: %.4f\n", x1_x2_test$tau))
cat(sprintf("p-value: %.8f\n", x1_x2_test$p_value))
cat(sprintf("Decision: %s (α = %.2f)\n", x1_x2_test$decision, alpha))

if (x1_x2_test$is_significant) {
  cat("\nInterpretation: There is a statistically significant monotonic relationship between\n")
  cat("temperature and energy output. The negative tau value indicates that as temperature\n")
  cat("increases, energy output tends to decrease, and this relationship is not due to chance.\n")
} else {
  cat("\nInterpretation: There is insufficient evidence to conclude that a monotonic relationship\n")
  cat("exists between temperature and energy output.\n")
}
Hypothesis Test: Temperature and Energy Output Relationship
------------------------------------------------------------
H₀: There is no significant monotonic relationship between temperature (x1) and energy output (x2)
------------------------------------------------------------
H₀: There is no significant monotonic relationship between temperature (x1) and energy output (x2)
H₁: There is a significant monotonic relationship between temperature (x1) and energy output (x2)

Results:
Kendall's Tau: -0.7956
p-value: 0.00000000
Decision: Reject H₀ (α = 0.05)

Interpretation: There is a statistically significant monotonic relationship between
temperature and energy output. The negative tau value indicates that as temperature
increases, energy output tends to decrease, and this relationship is not due to chance.

So, we reject the Null Hypothesis and conclude that there is a significant relationship between the temperature (x1) and energy output (x2).

1.7.2: Hypothesis Testing for Multicollinearity between Ambient Pressure (x3) and Exhaust Vaccum (x4):¶
  • Null Hypothesis $H₀$: There is no significant multicollinearity between ambient pressure (x3) and exhaust vacuum (x4).

  • Alternative Hypothesis $H₁$: There is significant multicollinearity between ambient pressure (x3) and exhaust vacuum (x4).

In [42]:
# Hypothesis Test: Multicollinearity Between Ambient Pressure (x3) and Exhaust Vacuum (x4)

# 1. Define Hypotheses:
cat("Hypothesis Test: Multicollinearity Between Ambient Pressure (x3) and Exhaust Vacuum (x4)\n")
cat("----------------------------------------------------------------------------------------\n")
cat("Null Hypothesis (H0): There is no significant multicollinearity between x3 (pressure) and x4 (vacuum).\n")
cat("Alternative Hypothesis (H1): There is significant multicollinearity between x3 (pressure) and x4 (vacuum).\n\n")

# 2. Calculate VIF for each variable
# VIF = 1 / (1 - R^2), where R^2 is from regressing one variable on the other

# Model 1: x3 ~ x4
model_x3 <- lm(x3 ~ x4, data = data)
r2_x3 <- summary(model_x3)$r.squared
vif_x3 <- 1 / (1 - r2_x3)

# Model 2: x4 ~ x3
model_x4 <- lm(x4 ~ x3, data = data)
r2_x4 <- summary(model_x4)$r.squared
vif_x4 <- 1 / (1 - r2_x4)

cat(sprintf("VIF for x3 (pressure): %.3f\n", vif_x3))
cat(sprintf("VIF for x4 (vacuum):   %.3f\n", vif_x4))

# 3. Decision threshold (commonly 5 or 10)
vif_threshold <- 5

if (vif_x3 > vif_threshold | vif_x4 > vif_threshold) {
  cat("\nDecision: Reject H0 (VIF >", vif_threshold, ")\n")
  cat("Interpretation: There is significant multicollinearity between x3 and x4.\n")
  cat("This means these variables share substantial variance, which can make regression coefficients unstable.\n")
} else {
  cat("\nDecision: Fail to reject H0 (VIF <=", vif_threshold, ")\n")
  cat("Interpretation: There is no significant multicollinearity between x3 and x4.\n")
  cat("Both variables can be included in regression models without major concern for instability.\n")
}

# 4. VIF Interpretation Context
cat("\nVIF Interpretation Guide:\n")
cat("  VIF = 1: No multicollinearity\n")
cat("  1 < VIF < 5: Low to moderate multicollinearity (acceptable)\n")
cat("  5 ≤ VIF < 10: High multicollinearity (caution)\n")
cat("  VIF ≥ 10: Severe multicollinearity (problematic)\n")
Hypothesis Test: Multicollinearity Between Ambient Pressure (x3) and Exhaust Vacuum (x4)
----------------------------------------------------------------------------------------
Null Hypothesis (H0): There is no significant multicollinearity between x3 (pressure) and x4 (vacuum).
Alternative Hypothesis (H1): There is significant multicollinearity between x3 (pressure) and x4 (vacuum).

VIF for x3 (pressure): 1.209
VIF for x4 (vacuum):   1.209

Decision: Fail to reject H0 (VIF <= 5 )
Interpretation: There is no significant multicollinearity between x3 and x4.
Both variables can be included in regression models without major concern for instability.

VIF Interpretation Guide:
  VIF = 1: No multicollinearity
  1 < VIF < 5: Low to moderate multicollinearity (acceptable)
  5 ≤ VIF < 10: High multicollinearity (caution)
  VIF ≥ 10: Severe multicollinearity (problematic)
1.7.3: Hypothesis Testing for Multicollinearity between Temperature (x1) and Pressure (x3):¶
  • Null Hypothesis (H0): There is no significant multicollinearity between x1 (temperature) and x3 (pressure).

  • Alternative Hypothesis (H1): There is significant multicollinearity between x1 (temperature) and x3 (pressure).

In [43]:
# Hypothesis Test: Multicollinearity Between Ambient Temperature (x1) and Ambient Pressure (x3)

# 1. State the Hypotheses:
cat("Hypothesis Test: Multicollinearity Between Ambient Temperature (x1) and Ambient Pressure (x3)\n")
cat("---------------------------------------------------------------------------------------------\n")
cat("Null Hypothesis (H0): There is no significant multicollinearity between x1 (temperature) and x3 (pressure).\n")
cat("Alternative Hypothesis (H1): There is significant multicollinearity between x1 (temperature) and x3 (pressure).\n\n")

# 2. Calculate VIF for each variable
# VIF = 1 / (1 - R^2), where R^2 is from regressing one variable on the other

# Model 1: x1 ~ x3
model_x1 <- lm(x1 ~ x3, data = data)
r2_x1 <- summary(model_x1)$r.squared
vif_x1 <- 1 / (1 - r2_x1)

# Model 2: x3 ~ x1
model_x3 <- lm(x3 ~ x1, data = data)
r2_x3 <- summary(model_x3)$r.squared
vif_x3 <- 1 / (1 - r2_x3)

cat(sprintf("VIF for x1 (temperature): %.3f\n", vif_x1))
cat(sprintf("VIF for x3 (pressure):    %.3f\n", vif_x3))

# 3. Decision threshold (commonly 5)
vif_threshold <- 5

if (vif_x1 > vif_threshold | vif_x3 > vif_threshold) {
  cat("\nDecision: Reject H0 (VIF >", vif_threshold, ")\n")
  cat("Interpretation: There is significant multicollinearity between x1 and x3.\n")
  cat("This means these variables share substantial variance, which can make regression coefficients unstable.\n")
} else {
  cat("\nDecision: Fail to reject H0 (VIF <=", vif_threshold, ")\n")
  cat("Interpretation: There is no significant multicollinearity between x1 and x3.\n")
  cat("Both variables can be included in regression models without major concern for instability.\n")
}

# 4. VIF Interpretation Context
cat("\nVIF Interpretation Guide:\n")
cat("  VIF = 1: No multicollinearity\n")
cat("  1 < VIF < 5: Low to moderate multicollinearity (acceptable)\n")
cat("  5 ≤ VIF < 10: High multicollinearity (caution)\n")
cat("  VIF ≥ 10: Severe multicollinearity (problematic)\n")
Hypothesis Test: Multicollinearity Between Ambient Temperature (x1) and Ambient Pressure (x3)
---------------------------------------------------------------------------------------------
Null Hypothesis (H0): There is no significant multicollinearity between x1 (temperature) and x3 (pressure).
Alternative Hypothesis (H1): There is significant multicollinearity between x1 (temperature) and x3 (pressure).

---------------------------------------------------------------------------------------------
Null Hypothesis (H0): There is no significant multicollinearity between x1 (temperature) and x3 (pressure).
Alternative Hypothesis (H1): There is significant multicollinearity between x1 (temperature) and x3 (pressure).

VIF for x1 (temperature): 3.470
VIF for x3 (pressure):    3.470

Decision: Fail to reject H0 (VIF <= 5 )
Interpretation: There is no significant multicollinearity between x1 and x3.
Both variables can be included in regression models without major concern for instability.

VIF Interpretation Guide:
  VIF = 1: No multicollinearity
  1 < VIF < 5: Low to moderate multicollinearity (acceptable)
  5 ≤ VIF < 10: High multicollinearity (caution)
  VIF ≥ 10: Severe multicollinearity (problematic)

Task-2: Regression- Modelling the Relationship Between Environmental Variables and Power Output:¶

First, prepare our data for modeling:

In [44]:
# Create a clean dataset for modeling with original variable names
# This makes easy to match the model formulas later: 
data <- data.frame(
  x1 = data$x1,   # Temperature
  x3 = data$x3,   # Ambient Pressure
  x4 = data$x4,   # Exhaust Vacuum
  x5 = data$x5,   # Relative Humidity
  y = data$x2    # Output variable (Energy Output)
)
In [45]:
head(data)
A data.frame: 6 × 5
x1x3x4x5y
<dbl><dbl><dbl><dbl><dbl>
1 8.3440.771010.8490.01480.48
223.6458.491011.4074.20445.75
329.7456.901007.1541.91438.76
419.0749.691007.2276.79453.09
511.8040.661017.1397.20464.43
613.9739.161016.0584.60470.96

To address the problem of possible Multicolinearity that may exist between the dependent variables and to improving the numerical stability of our calculations and to obtain better optimized models, lets apply different transformations to the dataset.

From the experiments conducted in different notebooks (available in repository with names like: Testing_with_..transformation.ipynb), we have figured out that Log-Transformation produces the better performing results comparatively and so we decided to use this transformation here.

Log-Transformation:

In this transformation we replace each value x in our data with log(x) (usually natural log ln(x) or base-10 log10(x)). It compresses large values and expands small values, helping reduce data skewness. It turns multiplicative relationships into additive ones, which is often easier for models to handle. Log transformation is only applied to positive numbers (because log of 0 or negative values is undefined).

In [46]:
# Log Transformation (Natural Log): 

# Formula: x_transformed = log(x)

# Note: we add a small constant if data contains zeros or negatives
epsilon <- 1e-6  # Small constant to avoid log(0)

data <- data.frame(
  x1 = log(data$x1 + epsilon),
  x3 = log(data$x3 + epsilon),
  x4 = log(data$x4 + epsilon),
  x5 = log(data$x5 + epsilon),
  #y  = log(data$y  + epsilon) # Transformation of target variable too
  # Tried transforming y too but it makes interpretation difficult and seems unnecessary
  y= data$y                    # No transformation of target variable
)
In [47]:
# Display the first few rows of the transformed dataset
head(data)
A data.frame: 6 × 5
x1x3x4x5y
<dbl><dbl><dbl><dbl><dbl>
12.1210633.7079476.9185374.499921480.48
23.1629404.0688566.9190914.306764445.75
33.3924934.0412956.9148803.735524438.76
42.9481163.9058046.9149494.341074453.09
52.4681003.7052456.9247404.576771464.43
62.6369123.6676566.9236784.437934470.96
In [48]:
# Check summary statistics of transformed data

summary(data[, c("x1", "x3", "x4", "x5")])
       x1               x3              x4              x5       
 Min.   :0.5933   Min.   :3.233   Min.   :6.901   Min.   :3.241  
 1st Qu.:2.6049   1st Qu.:3.731   1st Qu.:6.917   1st Qu.:4.149  
 Median :3.0131   Median :3.953   Median :6.921   Median :4.317  
 Mean   :2.8887   Mean   :3.967   Mean   :6.921   Mean   :4.273  
 3rd Qu.:3.2469   3rd Qu.:4.197   3rd Qu.:6.925   3rd Qu.:4.441  
 Max.   :3.6139   Max.   :4.401   Max.   :6.941   Max.   :4.607  
In [49]:
# Check standard deviation is approx to 
sapply(data[, c("x1", "x3", "x4", "x5")], sd)
x1
0.454309591484867
x3
0.235640936800593
x4
0.00585847125925815
x5
0.21875617803414

Following are the models we have to estimate the parameters for:

Model 1:

$y = \theta_1 x_4 + \theta_2 x_3^2 + \theta_{bias}$

This model relates energy output $y$ to exhaust vaccum (x4) and the square of ambient pressure (x3²), plus a bias term.

Model 2:

$y = \theta_1 x_4 + \theta_2 x_3^2 + \theta_3 x_5 + \theta_{bias}$

This model extends Model 1 by adding relative humidity (x5) as an additional predictor.

Model 3:

$y = \theta_1 x_3 + \theta_2 x_4 + \theta_3 x_5^3$

This model relates energy output to ambient pressure (x3), exhaust vaccum (x4), and the cube of relative humidity (x5³). Note that this model has no bias term.

Model 4:

$y = \theta_1 x_4 + \theta_2 x_3^2 + \theta_3 x_5^3 + \theta_{bias}$

This model combines exhaust vaccum (x4), squared ambient pressure (x3²), and cubed relative humidity (x5³), plus a bias term.

Model 5:

$y = \theta_1 x_4 + \theta_2 x_1^2 + \theta_3 x_3^2 + \theta_{bias}$

This model relates energy output to exhaust vaccum (x4), squared temperature (x1²), and squared ambient pressure (x3²), plus a bias term.

Design Matrix:¶

A design matrix is a structured way to organize our data for regression analysis.

A design matrix (often denoted as X) is a table where:

  • Each row represents one observation or data point
  • Each column represents a feature or transformation of our data
  • The matrix includes all the predictor variables needed for our model

The design matrix organizes all the information in a way that mathematical algorithms can process.

In R, these models can be implemented as follows:

In [50]:
# Create Design Matrix for Model 1: y = θ₁x₄ + θ₂x₃² + θ_bias
# This model relates energy output to exhaust vaccum (x4) 
# and the square of ambient pressure (x3²)
# 
#' @param data A dataframe containing the variables x3 and x4
#' @return A matrix with columns for bias term, x4, and squared x3
create_model_1_matrix <- function(data) {
  # Add column of 1's for the bias/intercept term
  # Add column for relative humidity (x4) as is
  # Add column for squared ambient pressure (x3^2)
  X <- cbind(1, data$x4, data$x3^2)
  
  # Assign meaningful column names to the design matrix
  colnames(X) <- c("bias", "x4", "x3_squared")
  
  # Return the complete design matrix
  return(X)
}
# Generate the design matrix using our dataset
model_1_matrix <- create_model_1_matrix(data)

# Display the first few rows of the design matrix
head(model_1_matrix)
A matrix: 6 × 3 of type dbl
biasx4x3_squared
16.91853713.74887
16.91909116.55559
16.91488016.33207
16.91494915.25530
16.92474013.72884
16.92367813.45170
In [51]:
# Model 2: y = θ₁x₄ + θ₂x₃² + θ₃x₅ + θ_bias
# This model extends Model 1 by adding relative humidity (x5)

create_model_2_matrix <- function(data) {
  # Create design matrix for Model 2
  X <- cbind(1, data$x4, data$x3^2, data$x5)
  # Assign meaningful column names to the design matrix:
  colnames(X) <- c("bias", "x4", "x3_squared", "x5")
  return(X)
}

# Generate the design matrix using our dataset
model_2_matrix <-create_model_2_matrix(data)

# Display the first few rows of the design matrix
head(model_2_matrix)
A matrix: 6 × 4 of type dbl
biasx4x3_squaredx5
16.91853713.748874.499921
16.91909116.555594.306764
16.91488016.332073.735524
16.91494915.255304.341074
16.92474013.728844.576771
16.92367813.451704.437934
In [52]:
# Model 3: y = θ₁x₃ + θ₂x₄ + θ₃x₅³
# This model relates energy output to ambient pressure (x3), 
# exhaust vaccum (x4), 
# and the cube of relative humidity (x5³). 
#Note: No bias term in this model.

create_model_3_matrix <- function(data) {
  # Create design matrix for Model 3
  X <- cbind(data$x3, data$x4, data$x5^3)
  # Assign meaningful column names to the design matrix
  colnames(X) <- c("x3", "x4", "x5_cubed")
  return(X)
}
# Generate the design matrix using our dataset
model_3_matrix <- create_model_3_matrix(data)

# Display the first few rows of the design matrix
head(model_3_matrix)
A matrix: 6 × 3 of type dbl
x3x4x5_cubed
3.7079476.91853791.12019
4.0688566.91909179.88280
4.0412956.91488052.12604
3.9058046.91494981.80723
3.7052456.92474095.86884
3.6676566.92367887.40627
In [53]:
# Model 4: y = θ₁x₄ + θ₂x₃² + θ₃x₅³ + θ_bias
# This model combines exhaust vaccum (x4), 
#squared ambient pressure (x3²),
# and cubed relative humidity(x5³)

create_model_4_matrix <- function(data) {
  # Create design matrix for Model 4
  X <- cbind(1, data$x4, data$x3^2, data$x5^3)
  # Assigning meaningful column names to the design matrix
  colnames(X) <- c("bias", "x4", "x3_squared", "x5_cubed")
  return(X)
}

# Generate the design matrix using our dataset
model_4_matrix <- create_model_4_matrix(data)

# Display the first few rows of the design matrix
head(model_4_matrix)
A matrix: 6 × 4 of type dbl
biasx4x3_squaredx5_cubed
16.91853713.7488791.12019
16.91909116.5555979.88280
16.91488016.3320752.12604
16.91494915.2553081.80723
16.92474013.7288495.86884
16.92367813.4517087.40627
In [54]:
# Model 5: y = θ₁x₄ + θ₂x₁² + θ₃x₃² + θ_bias
# This model relates energy output to exhaust vaccum (x4),
#squared temperature (x1²),
# and squared ambient pressure (x3²)

create_model_5_matrix <- function(data) {
  # Create design matrix for Model 5
  X <- cbind(1, data$x4, data$x1^2, data$x3^2)
  # Assigning meaningful column names to the design matrix
  colnames(X) <- c("bias", "x4", "x1_squared", "x3_squared")
  return(X)
}

# Generate the design matrix using our dataset
model_5_matrix <- create_model_5_matrix(data)

# Display the first few rows of the design matrix
head(model_5_matrix)
A matrix: 6 × 4 of type dbl
biasx4x1_squaredx3_squared
16.918537 4.49891013.74887
16.91909110.00419116.55559
16.91488011.50900916.33207
16.914949 8.69139115.25530
16.924740 6.09151613.72884
16.923678 6.95330613.45170

The X matrices created by each create_model_X_matrix() function are design matrices. These matrices contain the predictor variables (and their transformations) that will be used to estimate the model parameters.

For example, in Model 1: $y = \theta_1 x_4 + \theta_2 x_3^2 + \theta_{bias}$

The design matrix X would have three columns:

  1. A column of 1's (for the bias term)

  2. The values of $x_4$ (Exhaust Vaccum)

  3. The squared values of $x_3$ ie. $x_3^2$ (Ambient Pressure squared)

The design matrix X organizes the predictor variables according to the model specification.

Task 2.1: Estimate Model Parameters:¶

Here I define a function estimate_parameters() that implements the Least Squares formula.

The mathematical expression for the Least Squares estimation is:

$θ̂ = (X^T X)^(-1)X^T y$

In [55]:
# Function to estimate parameters using Least Squares method:
#
# This function implements the Ordinary Least Squares (OLS) estimation
# to find the optimal parameter values that minimize the sum of squared residuals.
#
#' @param X The design matrix with predictor variables (including bias term if needed)
#' @param y The vector of target/response values
#' @return A vector of estimated parameters (theta_hat)
estimate_parameters <- function(X, y) {
  # Calculate the parameter estimates using the closed-form solution:
  # θ̂ = (X^T X)^(-1) X^T y
  
  # Step 1: Calculate X^T (transpose of X)
  # Step 2: Calculate X^T X (matrix multiplication)
  # Step 3: Calculate (X^T X)^(-1) (matrix inversion)
  # Step 4: Calculate (X^T X)^(-1) X^T (matrix multiplication)
  # Step 5: Calculate (X^T X)^(-1) X^T y (final multiplication with y)
  theta_hat <- solve(t(X) %*% X) %*% t(X) %*% y
  

  # Return the estimated parameters
  return(theta_hat)
}
In [56]:
# Extract the target variable:
y <- data$y
head(y)
  1. 480.48
  2. 445.75
  3. 438.76
  4. 453.09
  5. 464.43
  6. 470.96
In [57]:
# Estimate parameters for Model 1:
#
# Model 1: y = θ₁x₄ + θ₂x₃² + θ_bias
# This model relates energy output to exhaust vaccum (x4) 
# and the square of ambient pressure (x3²)
#
# Create the design matrix for Model 1:
# This generates a matrix with columns for (bias term, x4, and x3²)
X1 <- create_model_1_matrix(data)

# Apply the OLS estimation to find optimal parameters
# This calculates θ̂ = (X^T X)^(-1) X^T y to minimize sum of squared errors
theta1 <- estimate_parameters(X1, y)

# Display the estimated parameters
# The output shows the coefficient values for bias, x4, and x3²
cat("Model 1 parameters:\n")
print(theta1)
Model 1 parameters:
                   [,1]
bias       -3262.576790
x4           553.643467
x3_squared    -7.269425
                   [,1]
bias       -3262.576790
x4           553.643467
x3_squared    -7.269425

This output shows the estimated parameters (coefficients) for Model 1:

  1. bias $(-3262.576790)$:
    This is the intercept term and it represents the baseline energy output when all other variables are zero. Theoretically this is the energy output when all other parameters set to constant or zero. But, practically, this will never happen because Exhaust vacuum (cmHg): Always > 0 (total vacuum is impossible); Pressure: Always > 0 (vacuum would imply system failure).

  2. x4 $(553.643467)$:
    This is the coefficient for exhaust vaccum . It indicates that for each one-unit increase in exhaust vaccum, the energy output increases by approximately 553.64 units, holding other variables constant. This positive relationship suggests higher vaccum correlates with higher energy output.

  3. x3_squared $(-7.269425)$:
    This is the coefficient for the squared ambient pressure term. The negative value indicates that as the square of ambient pressure increases, energy output tends to decrease. This suggests a non-linear (quadratic) relationship between ambient pressure and energy output, with potentially an optimal pressure point beyond which efficiency declines.

These coefficients define the regression equation as:

$Energy Output = -3262.576790 + 553.643467×(vaccum) -7.269425×(pressure²)$

In [58]:
# Estimate parameters for Model 2:
#
# Model 2: y = θ₁x₄ + θ₂x₃² + θ₃x₅ + θ_bias
# This model extends Model 1 by adding relative humidity (x5) as an additional predictor
#
# Create the design matrix for Model 2: 
# This generates a matrix with columns for (bias term, x4, x3², and x5)
X2 <- create_model_2_matrix(data)

# Apply the OLS estimation to find optimal parameters: 
# This calculates θ̂ = (X^T X)^(-1) X^T y to minimize sum of squared errors
theta2 <- estimate_parameters(X2, y)

# Display the estimated parameters:
# The output shows the coefficient values for bias, x4, x3², and x5
cat("\nModel 2 parameters:\n")
print(theta2)
Model 2 parameters:
                   [,1]
bias       -3372.839311
x4           562.445136
x3_squared    -6.891603
x5            10.153047
                   [,1]
bias       -3372.839311
x4           562.445136
x3_squared    -6.891603
x5            10.153047
In [59]:
# Estimate parameters for Model 3:
#
# Model 3 formula: y = θ₁x₃ + θ₂x₄ + θ₃x₅³
# This model relates energy output to ambient pressure (x3), exhaust vaccum (x4),
# and the cube of relative humidity (x5³). Note: This model has no bias/intercept term.
#
# Create the design matrix for Model 3
# This generates a matrix with columns for (x3, x4, and x5³) (no bias term)
X3 <- create_model_3_matrix(data)

# Apply the OLS estimation to find optimal parameters
# This calculates θ̂ = (X^T X)^(-1) X^T y to minimize sum of squared errors
theta3 <- estimate_parameters(X3, y)

# Display the estimated parameters
# The output shows the coefficient values for x3, x4, and x5³
cat("\nModel 3 parameters:\n")
print(theta3)
Model 3 parameters:
                [,1]
x3       -59.7397694
x4        97.7646009
x5_cubed   0.1869576
                [,1]
x3       -59.7397694
x4        97.7646009
x5_cubed   0.1869576
In [60]:
# Estimate parameters for Model 4:
#
# Model 4 formula: y = θ₁x₄ + θ₂x₃² + θ₃x₅³ + θ_bias
# This model combines exhaust vaccum (x4), squared ambient pressure (x3²),
# and cubed relative humidity (x5³) with a bias term
#
# Create the design matrix for Model 4
# This generates a matrix with columns for (bias term, x4, x3², and x5³)
X4 <- create_model_4_matrix(data)

# Apply the OLS estimation to find optimal parameters
# This calculates θ̂ = (X^T X)^(-1) X^T y to minimize sum of squared errors
theta4 <- estimate_parameters(X4, y)

# Display the estimated parameters
# The output shows the coefficient values for bias, x4, x3², and x5³
cat("\nModel 4 parameters:\n")
print(theta4)
Model 4 parameters:
                   [,1]
bias       -3359.443301
x4           564.511467
x3_squared    -6.878110
x5_cubed       0.196818
                   [,1]
bias       -3359.443301
x4           564.511467
x3_squared    -6.878110
x5_cubed       0.196818
In [61]:
# Estimate parameters for Model 5:
#
# Model 5 formula: y = θ₁x₄ + θ₂x₁² + θ₃x₃² + θ_bias
# This model relates energy output to exhaust vaccum (x4), squared temperature (x1²),
# and squared ambient pressure (x3²) with a bias term
#
# Create the design matrix for Model 5
# This generates a matrix with columns for bias term, x4, x1², and x3²
X5 <- create_model_5_matrix(data)

# Apply the OLS estimation to find optimal parameters
# This calculates θ̂ = (X^T X)^(-1) X^T y to minimize sum of squared errors
theta5 <- estimate_parameters(X5, y)

# Display the estimated parameters
# The output shows the coefficient values for bias, x4, x1², and x3²
cat("\nModel 5 parameters:\n")
print(theta5)
Model 5 parameters:
                  [,1]
bias       -754.486865
x4          186.436251
x1_squared   -4.728874
x3_squared   -2.599210
                  [,1]
bias       -754.486865
x4          186.436251
x1_squared   -4.728874
x3_squared   -2.599210
In [62]:
# Create a summary table for all model parameters:
parameter_summary <- data.frame(
  Model = c("Model-1", "Model-2", "Model-3", "Model-4", "Model-5"),
  Parameters = c(
    paste("bias =", round(theta1[1], 4), ", θ₁(x4) =", round(theta1[2], 4), ", θ₂(x3²) =", round(theta1[3], 4)),
    paste("bias =", round(theta2[1], 4), ", θ₁(x4) =", round(theta2[2], 4), ", θ₂(x3²) =", round(theta2[3], 4), ", θ₃(x5) =", round(theta2[4], 4)),
    paste("θ₁(x3) =", round(theta3[1], 4), ", θ₂(x4) =", round(theta3[2], 4), ", θ₃(x5³) =", round(theta3[3], 4)),
    paste("bias =", round(theta4[1], 4), ", θ₁(x4) =", round(theta4[2], 4), ", θ₂(x3²) =", round(theta4[3], 4), ", θ₃(x5³) =", round(theta4[4], 4)),
    paste("bias =", round(theta5[1], 4), ", θ₁(x4) =", round(theta5[2], 4), ", θ₂(x1²) =", round(theta5[3], 4), ", θ₃(x3²) =", round(theta5[4], 4))
  ),
  Formula = c(
    "y = θ₁x₄ + θ₂x₃² + θ_bias",
    "y = θ₁x₄ + θ₂x₃² + θ₃x₅ + θ_bias",
    "y = θ₁x₃ + θ₂x₄ + θ₃x₅³",
    "y = θ₁x₄ + θ₂x₃² + θ₃x₅³ + θ_bias",
    "y = θ₁x₄ + θ₂x₁² + θ₃x₃² + θ_bias"
  )
)

# Print the summary table
parameter_summary
A data.frame: 5 × 3
ModelParametersFormula
<chr><chr><chr>
Model-1bias = -3262.5768 , θ₁(x4) = 553.6435 , θ₂(x3²) = -7.2694 y = θ₁x₄ + θ₂x₃² + θ_bias
Model-2bias = -3372.8393 , θ₁(x4) = 562.4451 , θ₂(x3²) = -6.8916 , θ₃(x5) = 10.153 y = θ₁x₄ + θ₂x₃² + θ₃x₅ + θ_bias
Model-3θ₁(x3) = -59.7398 , θ₂(x4) = 97.7646 , θ₃(x5³) = 0.187 y = θ₁x₃ + θ₂x₄ + θ₃x₅³
Model-4bias = -3359.4433 , θ₁(x4) = 564.5115 , θ₂(x3²) = -6.8781 , θ₃(x5³) = 0.1968y = θ₁x₄ + θ₂x₃² + θ₃x₅³ + θ_bias
Model-5bias = -754.4869 , θ₁(x4) = 186.4363 , θ₂(x1²) = -4.7289 , θ₃(x3²) = -2.5992y = θ₁x₄ + θ₂x₁² + θ₃x₃² + θ_bias
In [63]:
# also store the parameters in list for later use
model_parameters <- list(
  model1 = theta1,
  model2 = theta2,
  model3 = theta3,
  model4 = theta4,
  model5 = theta5
)
model_parameters
$model1
A matrix: 3 × 1 of type dbl
bias-3262.576790
x4 553.643467
x3_squared -7.269425
$model2
A matrix: 4 × 1 of type dbl
bias-3372.839311
x4 562.445136
x3_squared -6.891603
x5 10.153047
$model3
A matrix: 3 × 1 of type dbl
x3-59.7397694
x4 97.7646009
x5_cubed 0.1869576
$model4
A matrix: 4 × 1 of type dbl
bias-3359.443301
x4 564.511467
x3_squared -6.878110
x5_cubed 0.196818
$model5
A matrix: 4 × 1 of type dbl
bias-754.486865
x4 186.436251
x1_squared -4.728874
x3_squared -2.599210

Task 2.2: Calculating Residual Sum of Squared Errors (RSS):¶

For Models 1, 2, 3, 4, and 5, we need to calculate the Residual Sum of Squares (RSS) to evaluate how well each model fits the data.

In [64]:
# Function to calculate Residual Sum of Squares (RSS);
#
# This function computes the RSS, which measures the discrepancy between
# the observed values and the values predicted by the model.
# RSS = Σ(y_i - ŷ_i)² where ŷ_i = x_i θ̂
#
#' @param X The design matrix with predictor variables
#' @param y The vector of observed target values
#' @param theta The vector of estimated parameters
#' @return The residual sum of squares (RSS)
calculate_rss <- function(X, y, theta) {
  # Calculate predicted values (ŷ) using the model
  # ŷ = X θ̂ (matrix multiplication of design matrix and parameters)
  y_pred <- X %*% theta
  
  # Calculate residuals (errors)
  # residuals = y - ŷ (difference between observed and predicted values)
  residuals <- y - y_pred
  
  # Square each residual
  # squared_residuals = residuals²
  squared_residuals <- residuals^2
  
  # Sum all squared residuals to get RSS
  # RSS = Σ(residuals²)
  rss <- sum(squared_residuals)
  
  # Return the calculated RSS value
  return(rss)
}

Calculate RSS for each model:

In [65]:
# Calculate RSS for Model 1: y = θ₁x₄ + θ₂x₃² + θ_bias: 

rss1 <- calculate_rss(X1, y, theta1)
cat("RSS for Model 1:", rss1)
RSS for Model 1: 558810.1
In [66]:
# Calculate RSS for Model 2: y = θ₁x₄ + θ₂x₃² + θ₃x₅ + θ_bias: 

rss2 <- calculate_rss(X2, y, theta2)
cat("RSS for Model 2:", rss2)
RSS for Model 2: 516315.9
In [67]:
# Calculate RSS for Model 3: y = θ₁x₃ + θ₂x₄ + θ₃x₅³

rss3 <- calculate_rss(X3, y, theta3)
cat("RSS for Model 3:", rss3 )
RSS for Model 3: 568957.4
In [68]:
# Calculate RSS for Model 4: y = θ₁x₄ + θ₂x₃² + θ₃x₅³ + θ_bias

rss4 <- calculate_rss(X4, y, theta4)
cat("RSS for Model 4:", rss4)
RSS for Model 4: 515095
In [69]:
# Calculate RSS for Model 5: y = θ₁x₄ + θ₂x₁² + θ₃x₃² + θ_bias

rss5 <- calculate_rss(X5, y, theta5)
cat("RSS for Model 5:", rss5)
RSS for Model 5: 198462.5
In [70]:
# Store RSS values for comparison; 

rss_values <- c(
  Model1 = rss1,
  Model2 = rss2,
  Model3 = rss3,
  Model4 = rss4,
  Model5 = rss5
)
In [71]:
# Create a comparison table
rss_comparison <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
  RSS = c(rss1, rss2, rss3, rss4, rss5)
)

# Print the comparison table
rss_comparison
A data.frame: 5 × 2
ModelRSS
<chr><dbl>
Model 1558810.1
Model 2516315.9
Model 3568957.4
Model 4515095.0
Model 5198462.5
In [72]:
# Plotting the RSS values using bar plots: 

# Prepare data frame for plotting
df <- data.frame(
  Model = names(rss_values),
  RSS = as.numeric(rss_values)
)

# Create bar plot comparing RSS values across models
model_comparison_plot <- ggplot(df, aes(x = reorder(Model, RSS), y = RSS, fill = Model)) +
  
  # Bars and labels
  geom_bar(stat = "identity", width = 0.7, alpha = 0.8) +
  geom_text(aes(label = round(RSS, 2)), vjust = -0.5, size = 3.5, fontface = "bold") +
  
  # Labels and titles
  labs(
    title = "Model Performance Comparison:",
    subtitle = "Residual Sum of Squares (RSS) Evaluation\nLower values indicate better fit:",
    x = "Regression Models",
    y = "RSS Value",
  ) +
  
  # Theme and formatting
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 19),
    plot.subtitle = element_text(hjust = 0.5, size = 15),
    axis.text.x = element_text(angle = 0, hjust = 1, vjust = 1, size =15),
    axis.title.x = element_text(margin = margin(t = 10)),
    legend.position = "none",
    plot.margin = margin(1, 1, 1, 1, unit = "cm")
  ) +
  
  # Y-axis buffer and color scale
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  scale_fill_viridis_d(option = "D", begin = 0.3, end = 0.9)

# Set plot size using repr
# width and height are in inches
options(jupyter.plot_mimetypes = "image/png") 
options(repr.plot.width = 20,   # Set default plot width
        repr.plot.height = 8)  # Set default plot height

# Show the plot
model_comparison_plot
No description has been provided for this image

Task 2.3: Calculate log-likelihood for each model:¶

In [73]:
# Calculate log-likelihood for each model

# The log-likelihood function is:
# ln p(D|θ̂) = -n/2 ln(2π) - n/2 ln(σ̂²) - RSS/(2σ̂²)
# Where σ̂² = RSS/(n-1) is the estimated variance

calculate_log_likelihood <- function(rss, n) {
  # Calculate the estimated variance (σ̂²)
  sigma_squared <- rss / (n - 1)
  
  # Calculate the log-likelihood using the formula
  log_likelihood <- -n/2 * log(2*pi) - n/2 * log(sigma_squared) - rss/(2*sigma_squared)
  
  return(log_likelihood)
}
In [74]:
# Get the number of observations (n)
n <- nrow(data)
cat("Number of observations:", n, "\n\n")
Number of observations: 9527 

In [75]:
# Calculate log-likelihood for each model
# For each model, we use the RSS calculated earlier

ll1 <- calculate_log_likelihood(rss1, n)
ll2 <- calculate_log_likelihood(rss2, n)
ll3 <- calculate_log_likelihood(rss3, n)
ll4 <- calculate_log_likelihood(rss4, n)
ll5 <- calculate_log_likelihood(rss5, n)
In [76]:
# Print the log-likelihood values: 

cat("Log-likelihood values:\n")
cat("Model 1:", ll1, "\n")
cat("Model 2:", ll2, "\n")
cat("Model 3:", ll3, "\n")
cat("Model 4:", ll4, "\n")
cat("Model 5:", ll5)
Log-likelihood values:
Model 1: -32913.67 
Model 2: -32536.92 
Model 3: -32999.4 
Model 2: -32536.92 
Model 3: -32999.4 
Model 4: -32525.65 
Model 5: -27982.45
In [77]:
# Store log-likelihood values

ll_values <- c(
  Model1 = ll1,
  Model2 = ll2,
  Model3 = ll3,
  Model4 = ll4,
  Model5 = ll5
)
In [78]:
# Create comparison table of RSS and log-likelihoods: 

ll_comparison <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
  RSS = c(rss1, rss2, rss3, rss4, rss5),
  LogLikelihood = c(ll1, ll2, ll3, ll4, ll5)
)

# Print the log-likelihood values with RSS: 
ll_comparison
A data.frame: 5 × 3
ModelRSSLogLikelihood
<chr><dbl><dbl>
Model 1558810.1-32913.67
Model 2516315.9-32536.92
Model 3568957.4-32999.40
Model 4515095.0-32525.65
Model 5198462.5-27982.45
  • Higher log-likelihood values indicate better model fit.
  • Log-likelihood increases as RSS decreases.
  • The log-likelihood will be used in the next task to calculate AIC and BIC.
  • Unlike RSS, log-likelihood incorporates the number of data points, making it more comparable across different datasets
In [79]:
# create dataframe and only include rss values: 
rss_df <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
  RSS = c(rss1, rss2, rss3, rss4, rss5)
)
rss_df
A data.frame: 5 × 2
ModelRSS
<chr><dbl>
Model 1558810.1
Model 2516315.9
Model 3568957.4
Model 4515095.0
Model 5198462.5

Task 2.4: Calculate AIC and BIC for each model:¶

In [80]:
# Calculate AIC and BIC for each model:

# AIC = 2k - 2ln p(D|θ̂)
# BIC = k·ln(n) - 2ln p(D|θ̂)
# Where:
# - k is the number of parameters in the model
# - n is the number of observations
# - ln p(D|θ̂) is the log-likelihood function

# First, determine the number of parameters (k) for each model
# k includes all coefficients plus the variance parameter
k1 <- ncol(X1)  # Model 1: bias, x4, x3²
k2 <- ncol(X2)  # Model 2: bias, x4, x3², x5
k3 <- ncol(X3)  # Model 3: x3, x4, x5³ (no bias)
k4 <- ncol(X4)  # Model 4: bias, x4, x3², x5³
k5 <- ncol(X5)  # Model 5: bias, x4, x1², x3²
In [81]:
k1 # k-value for model 1
3
In [82]:
# Function to calculate AIC and BIC:

calculate_aic_bic <- function(log_likelihood, k, n) {
  # Calculate AIC = 2k - 2ln p(D|θ̂)
  aic <- 2*k - 2*log_likelihood
  
  # Calculate BIC = k·ln(n) - 2ln p(D|θ̂)
  bic <- k*log(n) - 2*log_likelihood
  
  return(list(aic = aic, bic = bic))
}
In [83]:
# Calculate AIC and BIC for each model

aic_bic1 <- calculate_aic_bic(ll1, k1, n)
aic_bic2 <- calculate_aic_bic(ll2, k2, n)
aic_bic3 <- calculate_aic_bic(ll3, k3, n)
aic_bic4 <- calculate_aic_bic(ll4, k4, n)
aic_bic5 <- calculate_aic_bic(ll5, k5, n)
In [84]:
aic_bic1 # An example to see how they look
$aic
65833.347964127
$bic
65854.8336195821
In [85]:
# Create a comparison table with AIC, BIC,
# Log-Likelihood, RSS, number of parameters, and model names:

aic_bic_comparison <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
  Parameters = c(k1, k2, k3, k4, k5),
  RSS = c(rss1, rss2, rss3, rss4, rss5),
  LogLikelihood = c(ll1, ll2, ll3, ll4, ll5),
  AIC = c(aic_bic1$aic, aic_bic2$aic, aic_bic3$aic, aic_bic4$aic, aic_bic5$aic),
  BIC = c(aic_bic1$bic, aic_bic2$bic, aic_bic3$bic, aic_bic4$bic, aic_bic5$bic)
)

# Print the comparison table: 
cat("Model Comparison Table:\n")
aic_bic_comparison
Model Comparison Table:
A data.frame: 5 × 6
ModelParametersRSSLogLikelihoodAICBIC
<chr><int><dbl><dbl><dbl><dbl>
Model 13558810.1-32913.6765833.3565854.83
Model 24516315.9-32536.9265081.8565110.50
Model 33568957.4-32999.4066004.7966026.28
Model 44515095.0-32525.6565059.2965087.94
Model 54198462.5-27982.4555972.9156001.55
In [86]:
# Identify the better performing model according to 
# each criterion:AIC, BIC, RSS and Log-Likelihood

best_rss <- aic_bic_comparison$Model[which.min(aic_bic_comparison$RSS)] # low RSS is better
best_aic <- aic_bic_comparison$Model[which.min(aic_bic_comparison$AIC)] # low AIC is better
best_bic <- aic_bic_comparison$Model[which.min(aic_bic_comparison$BIC)] # low BIC is better

cat("\nBetter performing model according to RSS:", best_rss, "\n")
cat("Better performing model according to AIC:", best_aic, "\n")
cat("Better performing model according to BIC:", best_bic, "\n")
Better performing model according to RSS: Model 5 
Better performing model according to AIC: Model 5 
Better performing model according to BIC: Model 5 
Better performing model according to AIC: Model 5 
Better performing model according to BIC: Model 5 

So, based on the analysis we have conducted, model 5 shows better performance compared to other models if we compare them based only on RSS, AIC and BIC. Exploration is not done yet so let's explore further.

In [87]:
# Store the results for later use
aic_values <- c(
  Model1 = aic_bic1$aic,
  Model2 = aic_bic2$aic,
  Model3 = aic_bic3$aic,
  Model4 = aic_bic4$aic,
  Model5 = aic_bic5$aic
)
In [88]:
bic_values <- c(
  Model1 = aic_bic1$bic,
  Model2 = aic_bic2$bic,
  Model3 = aic_bic3$bic,
  Model4 = aic_bic4$bic,
  Model5 = aic_bic5$bic
)
  • Lower AIC and BIC values indicate better models

  • BIC penalizes model complexity more heavily than AIC

  • Both metrics balance goodness of fit (log-likelihood) against model complexity (k)

  • The best model according to these criteria may not be the same as the model with the lowest RSS

In [89]:
# cretae a dataframe for aic and bic values
aic_bic_df <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
  AIC = c(aic_bic1$aic, aic_bic2$aic, aic_bic3$aic, aic_bic4$aic, aic_bic5$aic),
  BIC = c(aic_bic1$bic, aic_bic2$bic, aic_bic3$bic, aic_bic4$bic, aic_bic5$bic)
)
aic_bic_df
A data.frame: 5 × 3
ModelAICBIC
<chr><dbl><dbl>
Model 165833.3565854.83
Model 265081.8565110.50
Model 366004.7966026.28
Model 465059.2965087.94
Model 555972.9156001.55
R-squared($R^2$), Adjusted R-squared ($R^2_{\text{adj}}$), Root mean squared error ($RMSE$) and Mean percentage error ($MPE$):¶

R-squared also called the coefficient of determination is the measure of how much of the variance in the actual data is explained by the model. Its values ranges from 0 to 1. If we get the value 0 or close to it then this means the model explains nothing and if we get 1 or close to this value then it indicates a perfect fit.

$$R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}$$

Where:

  • $SS_{\text{res}}$ = sum of squared residuals (errors)
  • $SS_{\text{tot}}$ = total sum of squares (variation in actual data)
In [90]:
# RSS values calculated above:
rss1 
rss2
rss3
rss4
rss5 
558810.063026389
516315.864810877
568957.372863864
515094.978374438
198462.455464457
In [91]:
# R-squared: Proportion of variance explained by the model
r_squared <- function(y, y_pred) {
  ss_res <- sum((y - y_pred)^2)           # Residual sum of squares
  ss_tot <- sum((y - mean(y))^2)          # Total sum of squares
  1 - ss_res / ss_tot                     # R-squared formula
}

Adjusted R-squared ie. adjusted for number of predictions is similar to r-squared but it panelizes models with too many variables. This is suitable to compare different models with different number of predictors. This value increase only when adding a new feature(s) improves the model significantly.

$$R^2_{\text{adj}} = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right)$$

Where:

  • $n$ = number of observations
  • $p$ = number of predictors (independent variables)
In [92]:
# Adjusted R-squared: Adjusts R^2 for number of predictors (p) and observations (n)
# p = number of predictors (excluding intercept)
adjusted_r_squared <- function(y, y_pred, p) {
  n <- length(y)                          # Number of observations
  r2 <- r_squared(y, y_pred)              # R-squared
  1 - (1 - r2) * ((n - 1) / (n - p - 1))  # Adjusted R-squared formula
}
# actual is the vector of observed values (e.g., test or train y).
# predicted is the vector of predicted values from the model.
# p is the number of predictors (not including the intercept).

Root mean squared error(RMSE) also called average size of the error is the avrage size of the prediction errors. Lower the values, better the model is. One negative side of this is that it is sensitive to large errors because it squares the error terms.

$$RMSE = \sqrt{ \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 }$$

Where:

  • $y_i$ = actual value
  • $\hat{y}_i$ = predicted value
  • $n$ = number of data points
In [93]:
# Root Mean Squared Error (RMSE): Average magnitude of prediction error
# Function to calculate Root Mean Squared Error (RMSE)
rmse <- function(y, y_pred) {
  sqrt(mean((y - y_pred)^2))              # RMSE formula
}

Mean percentage error (MPE) also called average % difference tells on average how far off the model is, in percentage terms. Can represent the overprediction (ie. negative) or underprediction (ie. positive).

$$MPE = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}_i}{y_i} \right) \times 100$$

Where:

  • Positive MPE → model tends to underpredict
  • Negative MPE → model tends to overpredict
  • $y_i \neq 0$ is important to avoid division errors
In [94]:
# Mean Percentage Error (MPE): Average percentage error (can be negative)
# Function to calculate Mean Percentage Error (MPE)
mpe <- function(y, y_pred) {
  mean((y - y_pred) / y) * 100            # MPE formula (in percent)
}
In [95]:
# Function to compute all metrics for a list of models
evaluate_models <- function(y, model_list) {
  results <- data.frame(
    Model = character(),
    R2 = numeric(),
    Adj_R2 = numeric(),
    RMSE = numeric(),
    MPE = numeric(),
    stringsAsFactors = FALSE
  )
  for (i in seq_along(model_list)) {
    model <- model_list[[i]]
    X <- model$X
    theta <- model$theta
    y_pred <- X %*% theta
    p <- ncol(X) - as.integer(colnames(X)[1] == "bias") # exclude intercept if present
    results <- rbind(results, data.frame(
      Model = paste0("Model ", i),
      R2 = r_squared(y, y_pred),
      Adj_R2 = adjusted_r_squared(y, y_pred, p),
      RMSE = rmse(y, y_pred),
      MPE = mpe(y, y_pred)
    ))
  }
  return(results)
}

# Prepare your models as a list
model_list <- list(
  list(X = X1, theta = theta1),
  list(X = X2, theta = theta2),
  list(X = X3, theta = theta3),
  list(X = X4, theta = theta4),
  list(X = X5, theta = theta5)
)

# Call the function and print the summary table
metrics_summary <- evaluate_models(y, model_list)
print(metrics_summary)
    Model        R2    Adj_R2     RMSE         MPE
1 Model 1 0.7979488 0.7979064 7.658682 -0.02753546
2 Model 2 0.8133136 0.8132548 7.361726 -0.02538051
3 Model 3 0.7942798 0.7942150 7.727905 -0.02896322
4 Model 4 0.8137551 0.8136964 7.353017 -0.02532178
5 Model 5 0.9282411 0.9282185 4.564162 -0.01009492

Task 2.5: Analyzing Residual Distributions:¶

$Residuals$ = $Actual$ $values$ - $Predicted$ $values$

Residuals are the differences between observed values and model predictions. In regression analysis, we assume these residuals follow a normal distribution with mean zero. This assumption is crucial because:

  • It validates the statistical tests used in regression

  • It confirms that our model captures the systematic patterns in the data

  • It ensures prediction intervals are accurate

  • It indicates whether our model form is appropriate

Let's first calculate and extract the residuals for each model:

In [96]:
# Analysis of the Distribution of Residuals:

# This analysis helps verify if model assumptions are met, particularly:
# - Normality of residuals (should follow a normal distribution)
# - Homoscedasticity (constant variance across predictions)
# - Independence (no patterns in residuals)

# Calculate residuals for each model: 
# Residuals = Actual values - Predicted values
residuals1 <- y - X1 %*% theta1
residuals2 <- y - X2 %*% theta2
residuals3 <- y - X3 %*% theta3
residuals4 <- y - X4 %*% theta4
residuals5 <- y - X5 %*% theta5

# Store residuals in a data frame for easier analysis
residuals_df <- data.frame(
  Model1 = as.vector(residuals1),
  Model2 = as.vector(residuals2),
  Model3 = as.vector(residuals3),
  Model4 = as.vector(residuals4),
  Model5 = as.vector(residuals5)
)

# Preview the first few residuals
cat("First few residuals for each model:\n")
head(residuals_df)
First few residuals for each model:
A data.frame: 6 × 5
Model1Model2Model3Model4Model5
<dbl><dbl><dbl><dbl><dbl>
112.60036911.085685 8.56825710.961994 2.1117450
2-2.033020-2.651892-2.554337-2.564005 0.6075230
3-8.316513-3.014048-5.589774-3.251239 0.9377150
4-1.852457-2.291955-4.909931-2.208376-0.8681405
5-7.029623-9.371600-9.137398-9.662198-7.6155940
6-1.926094-2.744397-3.166952-2.773088 2.4674247
In [97]:
dim(residuals_df)
  1. 9527
  2. 5
In [98]:
# Calculate basic statistics for residuals:

# Function to calculate residual statistics: 
get_residual_stats <- function(residuals) {
  data.frame(
    Mean = mean(residuals),                # this should be close to zero
    Median = median(residuals),            # this should be close to zero
    SD = sd(residuals),                    # this is the measure of spread
    Skewness = skewness(residuals),        # this should be close to 0 for normal distribution
    Kurtosis = kurtosis(residuals),        # this should be close to 3 for normal distribution
    Min = min(residuals),                  # the minimum residual for the model 
    Max = max(residuals),                  # maximum residual for model
    Range = max(residuals) - min(residuals) # range of residuals: (max - min)
  )
}
In [99]:
# Calculate statistics for each model: 

stats1 <- get_residual_stats(residuals1)
stats2 <- get_residual_stats(residuals2)
stats3 <- get_residual_stats(residuals3)
stats4 <- get_residual_stats(residuals4)
stats5 <- get_residual_stats(residuals5)
In [100]:
# Combine all statistics into a single table for better comparision: 

residual_stats <- rbind(stats1, stats2, stats3, stats4, stats5)
rownames(residual_stats) <- c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5")

# Print the statistics: 
cat("\nResidual Statistics:\n")
residual_stats
Residual Statistics:
A data.frame: 5 × 8
MeanMedianSDSkewnessKurtosisMinMaxRange
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Model 1 7.687799e-09-0.285570727.659084 0.141748373.783532-37.7839734.9195272.70349
Model 2-1.829808e-09-0.263636217.362112 0.184323953.823095-37.2170932.0673669.28444
Model 3-1.924974e-03-0.048356237.728311-0.048525343.800702-37.7820735.7261973.50826
Model 4-3.562977e-08-0.252792067.353403 0.186486943.825071-37.5054731.9361169.44158
Model 5-4.855548e-09-0.016585364.564402-0.479302556.549272-46.6379716.4103963.04836

Normality Tests for the Distribution of Residuals:¶

Histograms with Density + Rug plots:

In [101]:
# hide warnings;
options(warn = -1)

# Load necessary libraries
library(ggplot2)
library(reshape2)

# Convert wide-format residuals data frame to long-format for ggplot
# This makes plotting multiple models easier using 'facet_wrap'
residuals_long <- melt(
  residuals_df,
  variable.name = "Model",    # Name for the column containing model names
  value.name = "Residual"     # Name for the column containing residuals
)

# Create histogram plot with density overlay and rug plot for each model
hist_plot <- ggplot(residuals_long, aes(x = Residual)) +
  
  # Plot histogram of residuals with density scaling
  geom_histogram(
    aes(y = ..density..),     # Use density instead of raw counts
    bins = 30,                # Number of bins in histogram
    fill = "skyblue",         # Bar fill color
    color = "white",          # Border color of bars
    alpha = 0.7               # Transparency for visual softness
  ) +
  
  # Add actual density curve (smooth estimate of distribution)
  geom_density(
    color = "darkblue",       # Line color for density
    linewidth = 1             # Line thickness
  ) +
  
  # Add rug plot for residuals (shows individual values)
  geom_rug(
    aes(x = Residual), 
    sides = "b", 
    color = "black", 
    alpha = 0.3, 
    length = unit(0.08, "npc")
  ) +
  
  # Facet the plot by each model so each gets its own histogram
  facet_wrap(~ Model, scales = "free_y") +
  
  # Add informative titles and axis labels
  labs(
    title = "Residual Distributions Across Models",
    subtitle = "Histogram with Density Curve (Blue) and Rug Plot (Black)",
    x = "Residual Value",
    y = "Density"
  ) +
  
  # Use a clean minimal theme for clarity
  theme_minimal() +
  theme(
    strip.text = element_text(size = 12, face = "bold"),     # Model names
    plot.title = element_text(hjust = 0.5, face = "bold"),   # Centered bold title
    plot.subtitle = element_text(hjust = 0.5)                # Centered subtitle
  )

# Show the plot
print(hist_plot)
# Reset warnings settings:
options(warn = 0)
No id variables; using all as measure variables

No description has been provided for this image

QQ-Plots for Normality Check:

Q-Q plot (Quantile-Quantile plot) compares the quantiles of the residuals (sample quantiles) to the quantiles of a theoretical distribution (commonly the standard normal distribution). If the residuals are normally distributed, the points should fall approximately along a straight 45-degree reference line. Deviation from this line, especially in the tails, suggests non-normality.

In [102]:
# Load required libraries
library(ggplot2)      # For creating the Q-Q plot
library(gridExtra)    # For arranging multiple plots (if needed)

#' Create a Quantile-Quantile (Q-Q) Plot Residuals
#' 
#'Creates a Q-Q plot against the standard normal distribution.
#'
#' @param residuals Numeric vector of model residuals (must be clean)
#' @param model_name Character string for plot title (default: "Model")
#' @param point_color Color for Q-Q points (default: "steelblue")
#' @param point_alpha Transparency for points (default: 0.7)
#' @param point_size Size of points (default: 2)
#' @param line_color Color for reference line (default: "red")
#' @param line_type Linetype for reference line (default: "dashed")
#' @param line_width Width of reference line (default: 0.8)
#' 
#' @return A ggplot object containing the Q-Q plot
create_qq_plot <- function(residuals, 
                                model_name = "Model", 
                                point_color = "steelblue",
                                point_alpha = 0.7,
                                point_size = 2,
                                line_color = "red",
                                line_type = "dashed",
                                line_width = 0.8) {

  # Calculate Theoretical and Sample Quantiles
  # Get number of observations
  n <- length(residuals)
  
  # Calculate probability points using Blom's method
  # (1:n - 0.5)/n gives unbiased quantile estimates for normal distribution
  p <- (1:n - 0.5) / n
  
  # Get theoretical quantiles from standard normal distribution
  theoretical_q <- qnorm(p)
  
  # Get empirical quantiles by sorting the residuals
  empirical_q <- sort(residuals)
  
  # Prepare Data for Plotting

  # Create data frame for ggplot
  qq_data <- data.frame(
    Theoretical = theoretical_q,  # x-axis values
    Empirical = empirical_q       # y-axis values
  )
  
  # Calculate parameters for reference line
  # Using mean and sd of residuals makes the line represent
  # what we would expect if residuals were perfectly normal
  ref_intercept <- mean(residuals)
  ref_slope <- sd(residuals)
  
  # Create the Q-Q Plot: 
  
  # Initialize ggplot with our data
  qq_plot <- ggplot(qq_data, aes(x = Theoretical, y = Empirical)) +
    
    # Add the points for the Q-Q plot
    geom_point(
      color = point_color,      # Point color from parameters
      alpha = point_alpha,      # Transparency from parameters
      size = point_size,        # Size from parameters
      shape = 16                # Solid circle shape (16 is ggplot's default)
    ) +
    
    # Add the reference line
    # This line shows what perfect normality would look like
    geom_abline(
      intercept = ref_intercept,  # Line intercept (mean of residuals)
      slope = ref_slope,          # Line slope (sd of residuals)
      color = line_color,         # Line color from parameters
      linetype = line_type,       # Line type from parameters
      linewidth = line_width      # Line width from parameters
    ) +
    
    # Add labels and title
    labs(
      title = paste("Q-Q Plot for", model_name),  # Dynamic title with model name
      x = "Theoretical Quantiles (Standard Normal)",  # x-axis label
      y = "Sample Quantiles"                     # y-axis label
    ) +
    
    # Apply minimal theme (clean background)
    theme_minimal() +
    
    # Customize theme elements
    theme(
      plot.title = element_text(
        hjust = 0.5,            # Center the title
        face = "bold",          # Bold font
        size = 12,             # Slightly larger size
        margin = margin(b = 10) # Add space below title
      ),
      panel.grid.minor = element_blank(),  # Remove minor grid lines
      panel.grid.major = element_line(color = "grey90"),  # Light grid lines
      axis.line = element_line(color = "black")  # Add axis lines
    )
  
  # Return the plot object
  return(qq_plot)
}
# Example usage (assuming residuals1-residuals5 exist):
qq1 <- create_qq_plot(residuals1, "Model 1", point_color= "blue")
qq2 <- create_qq_plot(residuals2, "Model 2", point_color= "forestgreen")
qq3 <- create_qq_plot(residuals3, "Model 3", point_color="orange" )
qq4 <- create_qq_plot(residuals4, "Model 4", point_color= "brown")
qq5 <- create_qq_plot(residuals5, "Model 5", point_color= "purple")

# Arrange plots with proper layout
combined_qq <- grid.arrange(
   qq1, qq2, qq3, qq4, qq5,
   ncol = 2,
   top = "Comparing Q-Q Plots for Model Residuals:"
 )
No description has been provided for this image

We want to compare each model’s residuals against the standard normal distribution (mean 0, sd 1), so first we standardize the residuals for each model.

In [103]:
residuals1_std <- scale(residuals1)
residuals2_std <- scale(residuals2)
residuals3_std <- scale(residuals3)
residuals4_std <- scale(residuals4)
residuals5_std <- scale(residuals5)

Now, plot the qq-plots for the standardized residuals of all the models:

In [104]:
# Load required libraries
library(ggplot2)    # For creating plots
library(gridExtra)  # For arranging multiple plots in a grid

# Function to create a Q-Q plot for standardized residuals
create_qq_plot <- function(residuals_std, model_name, point_color) {

  # Get the number of residuals
  n <- length(residuals_std)

  # Compute the percentiles for the residuals
  # These are used to get the corresponding theoretical quantiles
  p <- (1:n - 0.5) / n

  # Compute the theoretical quantiles from the standard normal distribution
  theoretical_q <- qnorm(p)

  # Sort the standardized residuals to get empirical quantiles
  empirical_q <- sort(residuals_std)

  # Create a data frame with both quantiles for plotting
  qq_data <- data.frame(
    Theoretical = theoretical_q,
    Empirical = empirical_q
  )

  # Create the Q-Q plot
  ggplot(qq_data, aes(x = Theoretical, y = Empirical)) +
    geom_point(color = point_color, alpha = 0.6) +  # Add the points (empirical vs theoretical)
    
    # Add a 45-degree reference line (y = x), indicating perfect normality
    geom_abline(intercept = 0, slope = 1, color = "red", 
                linetype = "dashed", linewidth = 1) +

    # Add titles and axis labels
    labs(
      title = paste("Q-Q Plot for", model_name),
      x = "Theoretical Quantiles (N(0,1))",
      y = "Standardized Residual Quantiles"
    ) +

    # Use a minimal theme for a clean look
    theme_minimal() +

    # Customize the plot title appearance
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold", size = 12)
    )
}

# Create Q-Q plots for each model using their standardized residuals and a unique color
qq1 <- create_qq_plot(residuals1_std, "Model 1", "blue")
qq2 <- create_qq_plot(residuals2_std, "Model 2", "forestgreen")
qq3 <- create_qq_plot(residuals3_std, "Model 3", "purple")
qq4 <- create_qq_plot(residuals4_std, "Model 4", "orange")
qq5 <- create_qq_plot(residuals5_std, "Model 5", "brown")

# Display all five plots in a 2-column grid layout
# This helps visually compare the residual normality across models
grid.arrange(qq1, qq2, qq3, qq4, qq5, ncol = 2)
No description has been provided for this image
Proportion of Residuals within different Standard Deviation Ranges:¶
In [105]:
# Calculate proportion of residuals within different standard deviation ranges

# For a normal distribution:
# - ~68% of values should be within 1 standard deviation of the mean
# - ~95% of values should be within 2 standard deviations
# - ~99.7% of values should be within 3 standard deviations

# Function to calculate normality measures
calculate_normality_measures <- function(residuals, model_name) {
  # Calculate basic statistics
  mean_val <- mean(residuals)
  median_val <- median(residuals)
  sd_val <- sd(residuals)
  
  # Calculate standardized residuals
  std_residuals <- (residuals - mean_val) / sd_val
  
  # Calculate proportion within different SD ranges
  within_1sd <- mean(abs(std_residuals) <= 1) * 100
  within_2sd <- mean(abs(std_residuals) <= 2) * 100
  within_3sd <- mean(abs(std_residuals) <= 3) * 100
  
  # Expected proportions for normal distribution
  expected_1sd <- 68.27
  expected_2sd <- 95.45
  expected_3sd <- 99.73
  
  # Create results data frame
  data.frame(
    Model = model_name,
    Within1SD = sprintf("%.2f%% (Exp: 68.27%%)", within_1sd),
    Within2SD = sprintf("%.2f%% (Exp: 95.45%%)", within_2sd),
    Within3SD = sprintf("%.2f%% (Exp: 99.73%%)", within_3sd),
    Skewness = round(skewness(residuals), 4),
    ExcessKurtosis = round(kurtosis(residuals) - 3, 4)
  )
}

# Calculate measures for each model
measures1 <- calculate_normality_measures(residuals1, "Model 1")
measures2 <- calculate_normality_measures(residuals2, "Model 2")
measures3 <- calculate_normality_measures(residuals3, "Model 3")
measures4 <- calculate_normality_measures(residuals4, "Model 4")
measures5 <- calculate_normality_measures(residuals5, "Model 5")

# Combine results
normality_measures <- rbind(measures1, measures2, measures3, measures4, measures5)

# Print results
cat("\nNormality Distribution Measures:\n")
normality_measures
Normality Distribution Measures:
A data.frame: 5 × 6
ModelWithin1SDWithin2SDWithin3SDSkewnessExcessKurtosis
<chr><chr><chr><chr><dbl><dbl>
Model 170.81% (Exp: 68.27%)94.91% (Exp: 95.45%)99.34% (Exp: 99.73%) 0.14170.7835
Model 270.69% (Exp: 68.27%)95.17% (Exp: 95.45%)99.28% (Exp: 99.73%) 0.18430.8231
Model 370.83% (Exp: 68.27%)94.77% (Exp: 95.45%)99.38% (Exp: 99.73%)-0.04850.8007
Model 470.65% (Exp: 68.27%)95.14% (Exp: 95.45%)99.24% (Exp: 99.73%) 0.18650.8251
Model 569.00% (Exp: 68.27%)96.56% (Exp: 95.45%)99.55% (Exp: 99.73%)-0.47933.5493
Residuals vs. Fitted Values:¶

1. Observed Value:
The observed value is the actual value that was recorded in the dataset. In regression, it's denoted as $y$.

2. Predicted Value:
The predicted value is the value estimated by our regression model based on the input features. Denoted as $\hat{y}$.

3. Residual:
The residual is the difference between the observed and predicted value:

$$\text{Residual} = y - \hat{y}$$

  • It tells us how wrong the prediction was.
  • A positive residual means the model underestimated.
  • A negative residual means the model overestimated.

Goal in a good model: Residuals should be randomly scattered around 0 with no visible pattern.

4. Non-linearity:
A situation where the relationship between input variables and output is not linear (i.e., not a straight line).

5. Assumption in Linear Model:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \epsilon$$

But in real life, relationships can be curved, exponential, logarithmic, etc.

If the relationship is non-linear but we fit a linear model, the residuals will show patterns (like curves or waves) in the Residual vs. Fitted plot. That's a sign the model is misspecified (missing non-linear terms or interactions).

6. Heteroscedasticity:
Heteroscedasticity refers to a condition in regression where the variance of residuals is not constant across the range of fitted values. In simpler terms, as the predicted value increases or decreases, the spread (or variability) of the residuals also increases or decreases.

Let’s say we’re predicting Enorgy Output based on Temprature Levels. Now if our model:

  • Predicts low incomes accurately with small errors
  • But predicts high incomes with very large errors

Then we have heteroscedasticity — the residuals have larger variance as fitted values increase.

In a Residual vs Fitted plot:

  • Homoscedasticity: Points are evenly spread around 0, this is ideal.
  • Heteroscedasticity: Points look like a fan shape, cone, or funnel, spreading out as fitted values increase.

Residuals should have constant variance for all levels of the predictor variable(s).

7. Residuals are said to follow normality if:

  • Their distribution is symmetric and bell-shaped
  • Most residuals lie within:
    • 68.27% within ±1 SD
    • 95.45% within ±2 SD
    • 99.73% within ±3 SD
In [106]:
# Create residual plots (residuals vs. fitted values):

# These plots help check for homoscedasticity and patterns in residuals

# Function to create residual vs. fitted plot: 
create_residual_plot <- function(X, theta, residuals, model_name) {
  # Calculate fitted values
  fitted_values <- X %*% theta
  
  # Create data frame for plotting
  plot_data <- data.frame(
    Fitted = as.vector(fitted_values),
    Residual = as.vector(residuals)
  )
  
  # Create the plot
  ggplot(plot_data, aes(x = Fitted, y = Residual)) +
    geom_point(alpha = 0.5, color = "#4b4bb5") +
    geom_hline(yintercept = 0, linetype = "solid", color = "red") +
    geom_smooth(method = "loess", se = FALSE, color = "#06f706") +
    labs(title = paste("Residuals vs. Fitted Values for", model_name),
         x = "Fitted Values",
         y = "Residuals") +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold"),
      plot.subtitle = element_text(hjust = 0.5)
    )
}

# Create residual plots for each model
res_plot1 <- create_residual_plot(X1, theta1, residuals_df$Model1, "Model 1")
res_plot2 <- create_residual_plot(X2, theta2, residuals_df$Model2, "Model 2")
res_plot3 <- create_residual_plot(X3, theta3, residuals_df$Model3, "Model 3")
res_plot4 <- create_residual_plot(X4, theta4, residuals_df$Model4, "Model 4")
res_plot5 <- create_residual_plot(X5, theta5, residuals_df$Model5, "Model 5")

# Display the residual plots
grid.arrange(res_plot1, res_plot2, res_plot3, res_plot4, res_plot5, ncol = 2)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image

Each plot shows the relation between residuals (errors) vs. fitted (predicted) values. Residuals are mostly centered around zero, which is good—this means predictions are generally unbiased.

Spread of residuals is fairly constant across fitted values (no strong funnel shape), so there’s no major issue with heteroscedasticity (unequal error variance). Slight downward trend in residuals at higher fitted values hints that the models may slightly underpredict at the high end.

For model 5, except some outliers, majority of points are clustered arround the reference line (red line) with constant variance, no curve or funnel shape of the points around the line. The LOESS (green line) is almost parallel and overlapping with the reference (red line); indicating model 5 is performing exceptionally well.

Task 2.6: Selecting the Best Regression Model:¶

Now, finally we can create a model selection table based on all the statistics and computations we have like: rss, ll, aic, bic, residual_stats, R2, Adj_R2, RMSE, MPE, normality_measures, etc.

In [107]:
metrics_summary # we have from above: 
A data.frame: 5 × 5
ModelR2Adj_R2RMSEMPE
<chr><dbl><dbl><dbl><dbl>
Model 10.79794880.79790647.658682-0.02753546
Model 20.81331360.81325487.361726-0.02538051
Model 30.79427980.79421507.727905-0.02896322
Model 40.81375510.81369647.353017-0.02532178
Model 50.92824110.92821854.564162-0.01009492
In [108]:
# Create the comparison table for comparision: 
comparison_table <- data.frame(
  Model_1 = c(
    round(rss_comparison$RSS[1], 2),
    round(ll_comparison$LogLikelihood[1], 2),
    round(aic_bic_comparison$AIC[1], 2),
    round(aic_bic_comparison$BIC[1], 2),
    round(metrics_summary$R2[1], 2),
    round(metrics_summary$Adj_R2[1], 2),
    round(metrics_summary$RMSE[1], 2),
    round(metrics_summary$MPE[1], 2), 
    round(residual_stats$Mean[1], 6),
    round(residual_stats$Median[1], 2),
    round(residual_stats$SD[1], 2),
    round(residual_stats$Skewness[1], 3),
    round(residual_stats$Kurtosis[1], 3),
    round(residual_stats$Min[1], 2),
    round(residual_stats$Max[1], 2),
    round(residual_stats$Range[1], 2),
    normality_measures$Within1SD[1],
    normality_measures$Within2SD[1],
    normality_measures$Within3SD[1],
    round(normality_measures$ExcessKurtosis[1], 3)
  ),
  Model_2 = c(
    round(rss_comparison$RSS[2], 2),
    round(ll_comparison$LogLikelihood[2], 2),
    round(aic_bic_comparison$AIC[2], 2),
    round(aic_bic_comparison$BIC[2], 2),
    round(metrics_summary$R2[2], 2),
    round(metrics_summary$Adj_R2[2], 2),
    round(metrics_summary$RMSE[2], 2),
    round(metrics_summary$MPE[2], 2),
    round(residual_stats$Mean[2], 6),
    round(residual_stats$Median[2], 2),
    round(residual_stats$SD[2], 2),
    round(residual_stats$Skewness[2], 3),
    round(residual_stats$Kurtosis[2], 3),
    round(residual_stats$Min[2], 2),
    round(residual_stats$Max[2], 2),
    round(residual_stats$Range[2], 2),
    normality_measures$Within1SD[2],
    normality_measures$Within2SD[2],
    normality_measures$Within3SD[2],
    round(normality_measures$ExcessKurtosis[2], 3)
  ),
  Model_3 = c(
    round(rss_comparison$RSS[3], 2),
    round(ll_comparison$LogLikelihood[3], 2),
    round(aic_bic_comparison$AIC[3], 2),
    round(aic_bic_comparison$BIC[3], 2),
    round(metrics_summary$R2[3], 2),
    round(metrics_summary$Adj_R2[3], 2),
    round(metrics_summary$RMSE[3], 2),
    round(metrics_summary$MPE[3], 2), 
    round(residual_stats$Mean[3], 2),
    round(residual_stats$Median[3], 2),
    round(residual_stats$SD[3], 2),
    round(residual_stats$Skewness[3], 3),
    round(residual_stats$Kurtosis[3], 3),
    round(residual_stats$Min[3], 2),
    round(residual_stats$Max[3], 2),
    round(residual_stats$Range[3], 2),
    normality_measures$Within1SD[3],
    normality_measures$Within2SD[3],
    normality_measures$Within3SD[3],
    round(normality_measures$ExcessKurtosis[3], 3)
  ),
  Model_4 = c(
    round(rss_comparison$RSS[4], 2),
    round(ll_comparison$LogLikelihood[4], 2),
    round(aic_bic_comparison$AIC[4], 2),
    round(aic_bic_comparison$BIC[4], 2),
    round(metrics_summary$R2[4], 2),
    round(metrics_summary$Adj_R2[4], 2),
    round(metrics_summary$RMSE[4], 2),
    round(metrics_summary$MPE[4], 2), 
    round(residual_stats$Mean[4], 6),
    round(residual_stats$Median[4], 2),
    round(residual_stats$SD[4], 2),
    round(residual_stats$Skewness[4], 3),
    round(residual_stats$Kurtosis[4], 3),
    round(residual_stats$Min[4], 2),
    round(residual_stats$Max[4], 2),
    round(residual_stats$Range[4], 2),
    normality_measures$Within1SD[4],
    normality_measures$Within2SD[4],
    normality_measures$Within3SD[4],
    round(normality_measures$ExcessKurtosis[4], 3)
  ),
  Model_5 = c(
    round(rss_comparison$RSS[5], 2),
    round(ll_comparison$LogLikelihood[5], 2),
    round(aic_bic_comparison$AIC[5], 2),
    round(aic_bic_comparison$BIC[5], 2),
    round(metrics_summary$R2[5], 2),
    round(metrics_summary$Adj_R2[5], 2),
    round(metrics_summary$RMSE[5], 2),
    round(metrics_summary$MPE[5], 2), 
    round(residual_stats$Mean[5], 6),
    round(residual_stats$Median[5], 2),
    round(residual_stats$SD[5], 2),
    round(residual_stats$Skewness[5], 3),
    round(residual_stats$Kurtosis[5], 3),
    round(residual_stats$Min[5], 2),
    round(residual_stats$Max[5], 2),
    round(residual_stats$Range[5], 2),
    normality_measures$Within1SD[5],
    normality_measures$Within2SD[5],
    normality_measures$Within3SD[5],
    round(normality_measures$ExcessKurtosis[5], 3)
  ),
  row.names = c(
    "RSS", 
    "Log-Likelihood", 
    "AIC", 
    "BIC", 
    "R2",
    "Adj_R2",
    "RMSE",
    "MPE",
    "Residual_Mean", 
    "Residual_Median", 
    "Residual_SD", 
    "Residual_Skewness", 
    "Residual_Kurtosis", 
    "Residual_Min", 
    "Residual_Max", 
    "Residual_Range", 
    "Residuals_Within1SD", 
    "Residuals_Within2SD", 
    "Residuals_Within3SD", 
    "Residuals_ExcessKurtosis"
  )
)

# Print the comparison table
comparison_table
A data.frame: 20 × 5
Model_1Model_2Model_3Model_4Model_5
<chr><chr><chr><chr><chr>
RSS558810.06 516315.86 568957.37 515094.98 198462.46
Log-Likelihood-32913.67 -32536.92 -32999.4 -32525.65 -27982.45
AIC65833.35 65081.85 66004.79 65059.29 55972.91
BIC65854.83 65110.5 66026.28 65087.94 56001.55
R20.8 0.81 0.79 0.81 0.93
Adj_R20.8 0.81 0.79 0.81 0.93
RMSE7.66 7.36 7.73 7.35 4.56
MPE-0.03 -0.03 -0.03 -0.03 -0.01
Residual_Mean0 0 0 0 0
Residual_Median-0.29 -0.26 -0.05 -0.25 -0.02
Residual_SD7.66 7.36 7.73 7.35 4.56
Residual_Skewness0.142 0.184 -0.049 0.186 -0.479
Residual_Kurtosis3.784 3.823 3.801 3.825 6.549
Residual_Min-37.78 -37.22 -37.78 -37.51 -46.64
Residual_Max34.92 32.07 35.73 31.94 16.41
Residual_Range72.7 69.28 73.51 69.44 63.05
Residuals_Within1SD70.81% (Exp: 68.27%)70.69% (Exp: 68.27%)70.83% (Exp: 68.27%)70.65% (Exp: 68.27%)69.00% (Exp: 68.27%)
Residuals_Within2SD94.91% (Exp: 95.45%)95.17% (Exp: 95.45%)94.77% (Exp: 95.45%)95.14% (Exp: 95.45%)96.56% (Exp: 95.45%)
Residuals_Within3SD99.34% (Exp: 99.73%)99.28% (Exp: 99.73%)99.38% (Exp: 99.73%)99.24% (Exp: 99.73%)99.55% (Exp: 99.73%)
Residuals_ExcessKurtosis0.783 0.823 0.801 0.825 3.549

To make this Table even more visually appealing, we can do formatting using kable from the knitr:

In [109]:
# Create a formatted version with kable for better display:
library(knitr)
kable(comparison_table, 
      caption = "Models Comparison Table Based on all Calculated Metrics:",
      align = c('c', 'c', 'c', 'c', 'c', 'c')) # c for center

# Add model formulas for reference
model_formulas <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
  Formula = c(
    "y = θ₁x₄ + θ₂x₃² + θ_bias",
    "y = θ₁x₄ + θ₂x₃² + θ₃x₅ + θ_bias",
    "y = θ₁x₃ + θ₂x₄ + θ₃x₅³",
    "y = θ₁x₄ + θ₂x₃² + θ₃x₅³ + θ_bias",
    "y = θ₁x₄ + θ₂x₁² + θ₃x₃² + θ_bias"
  )
)
# Print model formulas
kable(model_formulas, caption = "Formulas for each Model:")

Table: Models Comparison Table Based on all Calculated Metrics:

|                         |       Model_1        |       Model_2        |       Model_3        |       Model_4        |       Model_5        |
|:------------------------|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|
|RSS                      |      558810.06       |      516315.86       |      568957.37       |      515094.98       |      198462.46       |
|Log-Likelihood           |      -32913.67       |      -32536.92       |       -32999.4       |      -32525.65       |      -27982.45       |
|AIC                      |       65833.35       |       65081.85       |       66004.79       |       65059.29       |       55972.91       |
|BIC                      |       65854.83       |       65110.5        |       66026.28       |       65087.94       |       56001.55       |
|R2                       |         0.8          |         0.81         |         0.79         |         0.81         |         0.93         |
|Adj_R2                   |         0.8          |         0.81         |         0.79         |         0.81         |         0.93         |
|RMSE                     |         7.66         |         7.36         |         7.73         |         7.35         |         4.56         |
|MPE                      |        -0.03         |        -0.03         |        -0.03         |        -0.03         |        -0.01         |
|Residual_Mean            |          0           |          0           |          0           |          0           |          0           |
|Residual_Median          |        -0.29         |        -0.26         |        -0.05         |        -0.25         |        -0.02         |
|Residual_SD              |         7.66         |         7.36         |         7.73         |         7.35         |         4.56         |
|Residual_Skewness        |        0.142         |        0.184         |        -0.049        |        0.186         |        -0.479        |
|Residual_Kurtosis        |        3.784         |        3.823         |        3.801         |        3.825         |        6.549         |
|Residual_Min             |        -37.78        |        -37.22        |        -37.78        |        -37.51        |        -46.64        |
|Residual_Max             |        34.92         |        32.07         |        35.73         |        31.94         |        16.41         |
|Residual_Range           |         72.7         |        69.28         |        73.51         |        69.44         |        63.05         |
|Residuals_Within1SD      | 70.81% (Exp: 68.27%) | 70.69% (Exp: 68.27%) | 70.83% (Exp: 68.27%) | 70.65% (Exp: 68.27%) | 69.00% (Exp: 68.27%) |
|Residuals_Within2SD      | 94.91% (Exp: 95.45%) | 95.17% (Exp: 95.45%) | 94.77% (Exp: 95.45%) | 95.14% (Exp: 95.45%) | 96.56% (Exp: 95.45%) |
|Residuals_Within3SD      | 99.34% (Exp: 99.73%) | 99.28% (Exp: 99.73%) | 99.38% (Exp: 99.73%) | 99.24% (Exp: 99.73%) | 99.55% (Exp: 99.73%) |
|Residuals_ExcessKurtosis |        0.783         |        0.823         |        0.801         |        0.825         |        3.549         |

Table: Formulas for each Model:

|Model   |Formula                           |
|:-------|:---------------------------------|
|Model 1 |y = θ₁x₄ + θ₂x₃² + θ_bias         |
|Model 2 |y = θ₁x₄ + θ₂x₃² + θ₃x₅ + θ_bias  |
|Model 3 |y = θ₁x₃ + θ₂x₄ + θ₃x₅³           |
|Model 4 |y = θ₁x₄ + θ₂x₃² + θ₃x₅³ + θ_bias |
|Model 5 |y = θ₁x₄ + θ₂x₁² + θ₃x₃² + θ_bias |

Model 5 is choosen as the better performing model based on the the metrics and statistics from the comparision table above.

  • It has low RSS (198,462.46), much lower then others.

  • It has higest Log-Likelihood value (-27,982.45); best among all the models we have.

  • It has significantly lowest AIC and BIC values (AIC = 55,972.91, BIC = 56,001.55).

  • It has higest R² and Adjusted R² (0.93); Explains 93% of variance.

  • It has lowset RMSE and Residual Standard Deviation (4.56); Indicating low average prediction errors.

  • It has lowest Mean Percentage Error (-0.01); very close to zero.

  • Residuals Analysis:
    • Residual Median: Closest to zero (−0.02) → Least biased predictions.
    • Residual Range: Smallest (63.05) → Less extreme errors.
    • Residuals within 2SD & 3SD: Highest proportions (96.56%, 99.55%) → More residuals concentrated near the mean.
    • Residual Skewness: -0.479 → Slightly left-skewed but still acceptable.
    • Residual Kurtosis: 6.549 → High → Indicates heavier tails, but not fatal as other stats are strong.

Task 2.7: Train Test Split and Confidence Intervals:¶

First, I Splite the dataset into 70% training and 30% testing sets. Training set will be used to estimate the model parameters (for selected Model-2) (and evaluate the model on the basis of trining data). Then, the testing set will be used to evaluate the model's performance on unseen data. At last we will compare the performance of the model based on training and testing data. The model should perform similarly on both sets if it generalizes well.

In [110]:
# Set seed for reproducibility: 
set.seed(0)

# Split the data into training (70%) and testing (30%) sets: 
train_indices <- sample(1:nrow(data), size = round(0.7 * nrow(data)))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

train_data is the training dataset and test_data is the testing dataset for our further analysis. Each dataset contains respective predictor variables $x1$, $x3$, $x4$, $x5$ and the target variable $y$ ($x2$).

In [111]:
# check first few rows of training data: 
head(train_data)
A data.frame: 6 × 5
x1x3x4x5y
<dbl><dbl><dbl><dbl><dbl>
10172.6775913.7305016.9261654.391977454.14
80042.6397713.8084396.9307684.412677464.79
47753.4397774.2536256.9141154.038479438.46
84621.0296203.6798396.9187054.418359482.66
40502.2438963.7771196.9248884.519612478.48
87893.4372084.2414716.9185373.815512429.74

Here, x1, x3, x4, and x5 are the independent variables, and y is the dependent variable.

In [112]:
# check first few rows of testining data: 
head(test_data)
A data.frame: 6 × 5
x1x3x4x5y
<dbl><dbl><dbl><dbl><dbl>
12.1210633.7079476.9185374.499921480.48
101.9125013.6423126.9253994.396053484.31
113.3421554.2293126.9140954.247066435.29
133.3775884.2490676.9186464.114637426.25
172.8981193.7741396.9205043.852273461.71
203.1587013.7160086.9098033.638901448.56
In [113]:
# Print the dimensions of the training and testing sets:

cat("Training set size:", nrow(train_data), "observations\n")
cat("Testing set size:", nrow(test_data), "observations\n")
Training set size: 6669 observations
Testing set size: 2858 observations
Testing set size: 2858 observations

The sum of size of training data and testing data should equal to the size of the original dataset:

In [114]:
nrow(train_data) + nrow(test_data)
9527

Now, let's create the design matrix for model 5. A function create_model_5_matrix() was defined in task-2, so I make use of it to create design matrix for the selected model:

In [115]:
# Create design matrices for Model-5: 

# Design matrix for training data:
# call the create_model_5_matrix() function to create the design matrix for training data:
# create_model_5_matrix() function is defined above in task 2 
X_train <- create_model_5_matrix(train_data)
y_train <- train_data$y # target variable for training set

# Design matrix for testing data:
# call the create_model_5_matrix() [I defined it earlier] function to create the design matrix for testing data:
X_test <- create_model_5_matrix(test_data)
y_test <- test_data$y # target variable for testing set

2.7.1: Model Evaluation on Training Data:¶

Now, let's create a residual plot to check for patterns in the test residuals. Here I first calculate the predictions made by the model on the test set and then compute the residuals by subtracting the predicted values from the actual values.

In [116]:
# Estimate model parameters using the training dataset:
# Here, use the estimate_parameters() [I defined it earlier] function to estimate the parameters:
theta_train <- estimate_parameters(X_train, y_train)
cat("\nModel 5 parameters estimated from training data:\n")
theta_train # Optimal parameters for Model 5 based on training data
Model 5 parameters estimated from training data:
A matrix: 4 × 1 of type dbl
bias-773.709226
x4 189.178063
x1_squared -4.740873
x3_squared -2.579386

Here, I am using the optimum parameters estimated from the training data and not from the entire dataset. Parameters that are estimated from the whole dataset may generalize well because they are estimated using a bigger dataset. But, we are asked to use the parameters estimated from the training data and I will stick to that.

In [117]:
# residuals for training data: 
train_residuals <- y_train - X_train %*% theta_train
In [118]:
dim(train_residuals)
  1. 6669
  2. 1

Now, let's visuallize the distribution of residuals using histogram + density curve + rug plot. In regression analysis, the residuals (the differences between the observed and predicted values) are expected to be normally distributed if the model's assumptions are met.

In [119]:
# distribution plot of training residuals: (Histogram + Density Curve + Rug Plot):

# Create a histogram of the training residuals
ggplot(data.frame(Residual = train_residuals), aes(x = Residual)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "#8207ed", color = "white", alpha = 0.7) +
  geom_density(color = "#ff8800", linewidth = 2) +
  # add rug plot:
  geom_rug(sides = "b", color = "black", alpha = 0.5) +
  labs(title = "Distribution of Training Residuals: (Model 5):",
       x = "Residual Value",
       y = "Density") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.4, face = "bold", size = 20),
    plot.subtitle = element_text(hjust = 0.5)
  )
No description has been provided for this image

Due to a few outliers, the distribution appears like a left-skewed but if we omit about 5-6 residuals then it is very close to normal distribution. The residuals form a shape that is roughly symmetrical (balenced on both sides of zero) omitting few extreme residuals on left, has only one peak (unimodal), almost centered around zero. The shape almost resembles the normal distribution. If we remove outliers from the dataset then this may become more normal-like.

Now, we will perform some statistical tests on the train_residuals to better confirm if they are normally distributed or not.

In [120]:
# descriptive statistics of residuals: 

# Calculate skewness and kurtosis
skew <- skewness(train_residuals)
kurt <- kurtosis(train_residuals)

cat("Descriptive Statistics for Residuals:\n")
cat("Mean:", mean(train_residuals^2),"\n")
cat("Median:", median(train_residuals^2), "\n")
cat("Standard Deviation:", sd(train_residuals^2), "\n")
cat("Skewness:", skew, "(0 for normal distribution),\n")
cat("Excess Kurtosis:", kurt-3, "(0 for normal distribution)")
Descriptive Statistics for Residuals:
Mean: 20.90017 
Median: 10.21065 
Mean: 20.90017 
Median: 10.21065 
Standard Deviation: 49.56946 
Skewness: -0.4860478 (0 for normal distribution),
Excess Kurtosis: 3.624235 (0 for normal distribution)
In [121]:
# Calculate proportion of residuals within different SD ranges
std_residuals <- (train_residuals^2 - mean(train_residuals^2)) / sd(train_residuals^2)
within_1sd <- mean(abs(std_residuals) <= 1) * 100
within_2sd <- mean(abs(std_residuals) <= 2) * 100
within_3sd <- mean(abs(std_residuals) <= 3) * 100

cat("\nProportion of Residuals within Standard Deviation Ranges:\n")
cat("Within 1 SD:", within_1sd, "% (Expected: 68.27%)\n")
cat("Within 2 SD:", within_2sd, "% (Expected: 95.45%)\n")
cat("Within 3 SD:", within_3sd, "% (Expected: 99.73%)")
Proportion of Residuals within Standard Deviation Ranges:
Within 1 SD: 95.06673 % (Expected: 68.27%)
Within 2 SD: 98.65047 % (Expected: 95.45%)
Within 3 SD: 99.50517 % (Expected: 99.73%)Within 1 SD: 95.06673 % (Expected: 68.27%)
Within 2 SD: 98.65047 % (Expected: 95.45%)
Within 3 SD: 99.50517 % (Expected: 99.73%)

It is concluded that train_residuals are normally distributed or close to it as the density plot of residuals, related Descriptive Statistics for Residuals and Proportion of Residuals within Standard Deviation Ranges all are in line to support this conclusion.

Actual vs. Predicted values:¶

Now, let's create scatter plots to compare actual vs predicted values for trainnig data. This gives an idea of how well our model is performing. If model is performing well for training data then, the predictions and the actual values should be close to each other as possible.

In [122]:
# Create a scatter plot comparing actual vs predicted values: 

# Calculate predicted values for training data
y_train_pred <- X_train %*% theta_train

# Create a dataframe for plotting
comparison_data <- data.frame(
  Actual = y_train,
  Predicted = as.vector(y_train_pred)
)

# Calculate R-squared for reference
train_r_squared <- 1 - sum((y_train - y_train_pred)^2) / sum((y_train - mean(y_train))^2)

# Create a scatter plot with actual values on x-axis and predicted on y-axis
ggplot(comparison_data, aes(x = Actual, y = Predicted)) +
  # Add points
  geom_point(color = "#3366CC", alpha = 0.6, size = 1.5) +
  
  # Add perfect prediction line (y = x)
  geom_abline(intercept = 0, slope = 1, color = "#CC3366", 
              linewidth = 1, linetype = "dashed") +
  
  # Add a smooth trend line to see the actual relationship
  geom_smooth(method = "lm", color = "#33CC66", se = FALSE, linewidth = 1) +
  
  # Add labels and title
  labs(title = "Model 5: Predicted vs Actual Values (Training Data)",
       subtitle = paste("(Points on diagonal line represent perfect predictions)"),
       x = "Actual Energy Output",
       y = "Predicted Energy Output") +
  
  # Use a clean theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    legend.position = "none"
  )
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image

This scatter plot compares model performance by plotting actual values along x-axis and predicted values along y-axis. The red dashed diagonal line ($y = x$) represents perfect predictions, where predicted values match actual values exactly. Points near this line indicate accurate predictions, while those above or below reflect overestimation or underestimation, respectively. A blue trend line is also included to show the general alignment between actual and predicted values, helping to identify any bias or systematic error in the model's output.

Almost all points are clustered around the diagonal line, which means model’s predictions are highly accurate. Very few points are scattered randomly which is normal because it is imposible to have perfect prediction on each and every observation. The reference line (red) and the overall trend line are parallel and almost overlap, further indicating high prediction accuracy.

95% Confidence Intervals for Model Predictions (Test Data):

In [123]:
# Calculate the residual variance from the training data: 
# This estimates the inherent noise/variability in our data
n_train <- nrow(X_train)                # Number of training samples
p <- length(theta_train)               # Number of model parameters
residuals_train <- y_train - X_train %*% theta_train  # Training residuals
rss_train <- sum(residuals_train^2)     # Residual sum of squares
sigma_squared <- rss_train / (n_train - p)  # Estimated error variance
In [124]:
# Calculate the variance-covariance matrix of the parameters
# This matrix captures uncertainty in our parameter estimates
XtX_inv <- solve(t(X_train) %*% X_train)  # (X'X)^-1 matrix
In [125]:
# Calculate prediction variances for each training point
# The prediction variance combines parameter uncertainty with inherent noise
prediction_vars <- numeric(nrow(X_train))
for (i in 1:nrow(X_train)) {
  x_i <- X_train[i, ]  # Feature vector for training point i
  # Formula: Var(ŷ) = σ² * x_i' * (X'X)^-1 * x_i
  prediction_vars[i] <- sigma_squared * (t(x_i) %*% XtX_inv %*% x_i)
}
In [126]:
# Calculate the standard errors of the predictions
prediction_se <- sqrt(prediction_vars)  # Standard error = sqrt(variance)
In [127]:
# Calculate the critical t-value for 95% confidence
# We use t-distribution with (n_train - p) degrees of freedom
t_critical <- qt(0.975, n_train - p)  # 0.975 for 95% confidence (two-tailed)
In [128]:
# Calculate the confidence intervals
# CI = prediction ± t_critical * standard_error
y_train_pred <- X_train %*% theta_train  # Predicted values
lower_ci <- y_train_pred - t_critical * prediction_se  # Lower bound
upper_ci <- y_train_pred + t_critical * prediction_se  # Upper bound
In [129]:
# Create a data frame for plotting
# Sort by actual values for better visualization
ci_data <- data.frame(
  Index = 1:nrow(X_train),            # Index for x-axis
  Actual = y_train,                   # Actual values
  Predicted = y_train_pred,           # Predicted values
  Lower_CI = lower_ci,                # Lower confidence bound
  Upper_CI = upper_ci,                # Upper confidence bound
  CI_Width = upper_ci - lower_ci      # Width of confidence interval
)
In [130]:
# Sort data by actual values for better visualization
ci_data <- ci_data[order(ci_data$Actual), ]

# Print confidence interval statistics for reference
cat("Confidence interval statistics for training data:\n")
cat("Mean width:", mean(ci_data$CI_Width), "\n")
cat("Maximum width:", max(ci_data$CI_Width), "\n")
cat("Minimum width:", min(ci_data$CI_Width), "\n")
cat("Range of y values:", min(y_train), "to", max(y_train))
Confidence interval statistics for training data:
Mean width: 0.4227833 
Maximum width: 1.395781 
Minimum width: 0.2196116 
Mean width: 0.4227833 
Maximum width: 1.395781 
Minimum width: 0.2196116 
Range of y values: 420.26 to 495.24
In [131]:
head(ci_data) # check top rows of ci_data dataframe: 
A data.frame: 6 × 6
IndexActualPredictedLower_CIUpper_CICI_Width
<int><dbl><dbl><dbl><dbl><dbl>
43864386420.26443.8340443.6209444.04710.4262423
16471647425.16431.8764431.6870432.06580.3788813
13871387425.17430.5321430.2981430.76610.4680361
11541154425.19430.0867429.8253430.34820.5229447
52875287425.20433.2140433.0047433.42330.4186112
58625862425.21432.2304432.0274432.43340.4060607
In [132]:
# Create the plot for predictions, confidence intervals, and actual values: 

prediction_plot <- ggplot(ci_data, aes(x = Index)) +
  # Add confidence interval ribbon
  # This shows the range where we expect the true value to fall with 95% probability
  geom_ribbon(aes(ymin = Lower_CI, ymax = Upper_CI, fill = "95% Confidence Interval"), 
              alpha = 0.5) +
  
  # Add predicted values as a line
  # This shows our model's best estimate for each training point
  geom_line(aes(y = Predicted, color = "Predicted Values"), 
            linewidth = 1.2) +
  
  # Add actual values as points
  # These show the true observed values we're trying to predict
  geom_point(aes(y = Actual, color = "Actual Values"), 
             size = 0.8, alpha = 0.7) +
  
  # Set custom colors for better visibility: 
  scale_color_manual(values = c("Predicted Values" = "#28ca76",  
                                "Actual Values" = "#f40f0f"), 
                     name = "") +
  scale_fill_manual(values = c("95% Confidence Interval" = "#ffc800"),  # Yellow
                    name = "") +
  
  # Add informative labels and title
  labs(title = "Model 5 Predictions with 95% Confidence Intervals (Training Data):",
       subtitle = "(Data points sorted by actual value)",
       x = "Training Data Index (sorted by actual value)",
       y = "Energy Output (y)") +
  
  # Add theme with legend at the bottom
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size= 20),
    plot.subtitle = element_text(hjust = 0.5, size= 15),
    legend.position = "bottom",
    legend.box = "horizontal",
    legend.margin = margin(t = 10)
  ) +
  
  # Combine the legends for better organization
  guides(color = guide_legend(order = 1),
         fill = guide_legend(order = 2))

# Display the plot
prediction_plot
No description has been provided for this image

It seems like the confidence intervals are too narrow to be visible on the plot. To address this issue, we can artificially expand the confidence intervals for visualization purposes only. This will help us see the intervals better without changing the actual statistical values. And this is not the best statistical solution to the problem but what I want to prove is confidence intervals do exist in the plot above.

In [133]:
# confidence intervals are too narrow to see, so create an expanded version
# by artificially widening the intervals by a factor 20 for better visualization
expansion_factor <- 20

expanded_ci_data <- ci_data
expanded_ci_data$Lower_CI_Expanded <- ci_data$Predicted - 
  (ci_data$Predicted - ci_data$Lower_CI) * expansion_factor
expanded_ci_data$Upper_CI_Expanded <- expanded_ci_data$Predicted + 
  (ci_data$Upper_CI - ci_data$Predicted) * expansion_factor

expanded_plot <- ggplot(expanded_ci_data, aes(x = Index)) +
  # Add expanded confidence interval ribbon
  geom_ribbon(aes(ymin = Lower_CI_Expanded, ymax = Upper_CI_Expanded, 
                 fill = "95% Confidence Interval (expanded for visibility)"), 
              alpha = 0.5) +
  
  # Add predicted values line
  geom_line(aes(y = Predicted, color = "Predicted Values"), 
            linewidth = 1) +
  
  # Add actual values points
  geom_point(aes(y = Actual, color = "Actual Values"), 
             size = 0.8, alpha = 0.7) +
  
  # Set custom colors
  scale_color_manual(values = c("Predicted Values" = "#33e78a", 
                                "Actual Values" = "black"),
                     name = "") +
  scale_fill_manual(values = c("95% Confidence Interval (expanded for visibility)" = "#f80000"),
                    name = "") +
  
  # Add labels and mention that it is expended version of plot above and use for understanding purpose only
  labs(title = "Model 5 Predictions with Expanded Confidence Intervals (Training Data):",
       subtitle = paste("(Note: CI width artificially expanded by", expansion_factor, "times for visibility)"),
       x = "Training Data Index (sorted by actual value)",
       y = "Energy Output (y)") +
  
  # Add theme
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 19),
    plot.subtitle = element_text(hjust = 0.5, size= 15),
    legend.position = "bottom",
    legend.box = "horizontal"
  )

# Display the expanded plot
print(expanded_plot)
No description has been provided for this image

Now, after expending the confidence intervals by a factor of 20, we can see the confidence intervals better. Expending the confidence intervals is not a statistically sound approach but here it is used only to better visualize the confidence intervals on the plot.

And, to give a better show of confidence intervals, create a zoom-in version of the plot by taking first 100 data points.

In [134]:
# Create a zoom-in version of the plot to better show the confidence intervals
# Select a small subset of the data (first 100 points) 

zoom_data <- ci_data[1:100, ]

zoom_plot <- ggplot(zoom_data, aes(x = Index)) +
  # Add confidence interval ribbon
  geom_ribbon(aes(ymin = Lower_CI, ymax = Upper_CI, fill = "95% Confidence Interval"), 
              alpha = 0.9) +
              
  # Add predicted values line
  geom_line(aes(y = Predicted, color = "Predicted Values"), 
            linewidth = 1.2) +
            
  # Add actual values points
  geom_point(aes(y = Actual, color = "Actual Values"), 
             size = 1.5, alpha = 0.7) +
             
  # Set custom colors
  scale_color_manual(values = c("Predicted Values" = "blue", 
                                "Actual Values" = "black"),
                     name = "") +
  scale_fill_manual(values = c("95% Confidence Interval" = "#f2220f"),
                    name = "") +
                    
  # Add labels
  labs(title = "Zoomed View: Model 5 Predictions with 95% Confidence Intervals (Training Dataset):",
       subtitle = "(First 100 data points)",
       x = "Training Data Index",
       y = "Energy Output (y)") +
       
  # Add theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size= 19),
    plot.subtitle = element_text(hjust = 0.5, size= 15),
    legend.position = "bottom"
  )

# Display the zoomed plot
print(zoom_plot)
No description has been provided for this image

This is the zoomed-in view of Model 2's predictions (training dataset) for the first 100 training data points with their 95% confidence intervals. The blue line represents the predicted energy output, while the red ribbon illustrates the confidence intervals, showing the range within which the true values are expected to fall with 95% certainty. The black dots indicate the actual observed values. For the first 100 samples, the model is overpredicting the enorgy output because most of the black dots are below the blue line. But this is a very small portion of the training dataset and this type of overprediction can be ignored.

And, even after magnifying the confidence intervals they are still narrow which reflect very low uncertainty in our predictions. This further validates that the model is performing well on the traning data.

Confidence Intervals and Error Bars for Training Data:

In [135]:
# Calculate the residual variance from the training data
n_train <- nrow(X_train)                # Number of training samples
p <- length(theta_train)               # Number of model parameters (including intercept)
residuals_train <- y_train - X_train %*% theta_train  # Calculate residuals
rss_train <- sum(residuals_train^2)     # Residual sum of squares
sigma_squared <- rss_train / (n_train - p)  # Estimated error variance

# Calculate the variance-covariance matrix of the parameters
# This captures uncertainty in our parameter estimates
XtX_inv <- solve(t(X_train) %*% X_train)  # (X'X)^-1 matrix

# Calculate prediction variances for each training point
# For each prediction, we need to calculate its variance
prediction_vars <- numeric(nrow(X_train))
for (i in 1:nrow(X_train)) {
  x_i <- X_train[i, ]  # Feature vector for training point i
  # Formula: Var(ŷ) = σ² * x_i' * (X'X)^-1 * x_i
  prediction_vars[i] <- sigma_squared * (t(x_i) %*% XtX_inv %*% x_i)
}

# Calculate the standard errors of the predictions
prediction_se <- sqrt(prediction_vars)  # Standard error = sqrt(variance)

# Calculate the critical t-value for 95% confidence
# We use t-distribution with (n_train - p) degrees of freedom
t_critical <- qt(0.975, n_train - p)  # 0.975 for 95% confidence (two-tailed)

# Calculate the confidence intervals
y_train_pred <- X_train %*% theta_train  # Predicted values
margin_of_error <- t_critical * prediction_se  # Margin of error for 95% CI
lower_ci <- y_train_pred - margin_of_error  # Lower bound
upper_ci <- y_train_pred + margin_of_error  # Upper bound

# Create a data frame for plotting
# For clarity and computing efficiency, use a subset of the data (e.g., every 50th point)
# Using all points would make the error bars too crowded and also, takes over a minute to create
# a single plot: 

sample_interval <- 50
sample_indices <- seq(1, nrow(X_train), by = sample_interval)

plot_data <- data.frame(
  Index = sample_indices,
  Actual = y_train[sample_indices],
  Predicted = y_train_pred[sample_indices],
  Lower_CI = lower_ci[sample_indices],
  Upper_CI = upper_ci[sample_indices],
  SE = prediction_se[sample_indices]
)

# Create a full data frame for ribbon plotting (all points)
full_data <- data.frame(
  Index = 1:nrow(X_train),
  Actual = y_train,
  Predicted = y_train_pred,
  Lower_CI = lower_ci,
  Upper_CI = upper_ci
)

# Print confidence interval statistics for reference
cat("Confidence interval statistics for training data:\n")
cat("Mean margin of error:", mean(margin_of_error), "\n")
cat("Maximum margin of error:", max(margin_of_error), "\n")
cat("Minimum margin of error:", min(margin_of_error), "\n")
Confidence interval statistics for training data:
Mean margin of error: 0.2113917 
Maximum margin of error: 0.6978904 
Minimum margin of error: 0.1098058 
Mean margin of error: 0.2113917 
Maximum margin of error: 0.6978904 
Minimum margin of error: 0.1098058 
In [136]:
# Create plot with:
# - Confidence interval ribbon (for all points)
# - Error bars (for sampled points)
# - Predicted values line
# - Actual values points

combined_plot <- ggplot() +
  # Add confidence interval ribbon for all points
  geom_ribbon(data = full_data, 
              aes(x = Index, ymin = Lower_CI, ymax = Upper_CI, 
                  fill = "95% Confidence Interval (ribbon)"), 
              alpha = 0.3) +
  
  # Add predicted values line for all points
  geom_line(data = full_data, 
            aes(x = Index, y = Predicted, color = "Predicted Values"), 
            linewidth = 0.8) +
  
  # Add actual values points for all points: 
  geom_point(data = full_data, 
             aes(x = Index, y = Actual, color = "Actual Values"), 
             size = 0.7, alpha = 0.5) +
  
  # Add error bars for sampled points
  geom_errorbar(data = plot_data, 
                aes(x = Index, ymin = Lower_CI, ymax = Upper_CI, 
                    color = "95% Confidence Interval (error bars)"),
                width = 10, linewidth = 0.7, alpha = 0.7) +
  
  # Add predicted values points for sampled points
  geom_point(data = plot_data, 
             aes(x = Index, y = Predicted, shape = "Predicted (sampled)"), 
             color = "#0f1110", size = 2) +
  
  # Set custom colors and shapes
  scale_color_manual(values = c("Predicted Values" = "#33e78a",
                                "Actual Values" = "#ff0000",
                                "95% Confidence Interval (error bars)" = "#2323f4"),
                     name = "Data Elements") +
  scale_fill_manual(values = c("95% Confidence Interval (ribbon)" = "#ffc800"),
                    name = "Confidence Region") +
  scale_shape_manual(values = c("Predicted (sampled)" = 19),
                     name = "Sampled Points") +
  
  # Add informative labels and title
  labs(title = "Model 5: Error Bars with 95% Confidence Intervals(Training Data):",
       subtitle = "(Showing CI (all points) and error bars (sampled points))\n
       (one error bar for every 50th data point:)",
       x = "Training Data Index",
       y = "Energy Output (y)") +
  
  # Add theme with legend at the bottom
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 19),
    plot.subtitle = element_text(hjust = 0.5, size = 15),
    legend.position = "bottom",
    legend.box = "vertical"
  )

# Display the combined plot
print(combined_plot)
No description has been provided for this image

This plot shows the model's predictions, actual values, error bars and confidence intervals for the training data. The green ribbon represents the 95% confidence interval (not visible clearly) for the predicted values, and shows the range within which the true energy output is expected to fall with high confidence. The red dots indicate the actual observed values, while the green line represents the predicted values from the model. Also, blue error bars (not visible clearly) are also shown for every 50th data point to highlight the confidence intervals more clearly. The predicted values that are from our sample data points (one sample data from every 50th data point) are marked with black dots.

One problem with this plot is that the confidence intervals and the error bars are too narrow to see clearly. This is happening because the confidence intervals are based on the standard error of the predictions, which is very small for our case. This is actually a good sign that model is making less errors but this results in visilualizing the interval and the error bars difficult.To show that they also exist on the plot, I will artificially expand the confidence intervals by a certain factor. Again, this is not a statistically sound approach but I use this only to show that the confidence intervals and the error bars exist and are very small.

In [137]:
# Create error bar plot

errorbar_plot <- ggplot(plot_data, aes(x = Index)) +
  # Add predicted values as points
  geom_point(aes(y = Predicted, color = "Predicted Values"), 
             size = 2) +
  
  # Add error bars for 95% confidence intervals
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI, color = "95% Confidence Interval"),
                width = 10, linewidth = 0.5, alpha = 0.7) +
  
  # Add actual values as different points
  geom_point(aes(y = Actual, color = "Actual Values"), 
             size = 2, shape = 17) +  # Using triangle shape for actual values
  
  # Set custom colors
  scale_color_manual(values = c("Predicted Values" = "#33e78a",
                                "Actual Values" = "#e73333",
                                "95% Confidence Interval" = "#0b0bf1"),
                     name = "") +
  
  # Add informative labels and title
  labs(title = "Model 5: Predictions with 95% Confidence Intervals (Error Bars)(Training Data):",
       subtitle = "(Error bars show 95% confidence intervals for predictions) \n
       (one error bar for every 50th data point)",
       x = "Training Data Index (sampled every 50 points)",
       y = "Energy Output (y)") +
  
  # Add theme with legend at the bottom
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size= 19),
    plot.subtitle = element_text(hjust = 0.5, size= 14),
    legend.position = "bottom",
    legend.box = "horizontal"
  )

# Display the error bar plot
print(errorbar_plot)
No description has been provided for this image

This illustrates the predictions made by Model 2 on the training data, along with their 95% confidence intervals. The blue error bars represent the confidence intervals for the predicted values, providing a range within which the true energy output is expected to fall with 95% confidence. The green points indicate the predicted values, while the red triangles represent the actual observed energy output values.

The error bars are shown for every 50th data point to avoid overcrowding, and they demonstrate the uncertainty associated with each prediction. The alignment of most red triangles (actual values) with the green points (predicted values) and within the blue error bars suggests that the model is performing well, with accurate predictions and reasonable confidence intervals.

In [138]:
# Create a zoomed-in version with more dense sampling for better visualization
# Select a small range of data points (e.g., points 100-200)
zoom_range <- 100:200
zoom_data <- data.frame(
  Index = zoom_range,
  Actual = y_train[zoom_range],
  Predicted = y_train_pred[zoom_range],
  Lower_CI = lower_ci[zoom_range],
  Upper_CI = upper_ci[zoom_range]
)

zoom_errorbar_plot <- ggplot(zoom_data, aes(x = Index)) +
  # Add predicted values as points
  geom_point(aes(y = Predicted, color = "Predicted Values"), 
             size = 2) +
  
  # Add error bars for 95% confidence intervals
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI, color = "95% Confidence Interval"),
                width = 0.5, linewidth = 0.5, alpha = 0.7) +
  
  # Add actual values as different points
  geom_point(aes(y = Actual, color = "Actual Values"), 
             size = 2, shape = 17) +
  
  # Connect predicted points with a line for clarity
  geom_line(aes(y = Predicted, color = "Predicted Values"), 
            linewidth = 0.5, alpha = 0.5) +
  
  # Set custom colors
  scale_color_manual(values = c("Predicted Values" = "#33e78a",
                                "Actual Values" = "#e73333",
                                "95% Confidence Interval" = "#3333e7"),
                     name = "") +
  
  # Add informative labels and title
  labs(title = "Zoomed View: Model 5 Predictions with 95% Confidence Intervals",
       subtitle = "(Training data points 100-200)",
       x = "Training Data Index",
       y = "Energy Output (y)") +
  
  # Add theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size= 19),
    plot.subtitle = element_text(hjust = 0.5, size= 15),
    legend.position = "bottom"
  )
  
# Display the zoomed plot
print(zoom_errorbar_plot)
No description has been provided for this image

This is a zoomed-in view of Model 5's predictions for training data points 100–200, along with their 95% confidence intervals. The blue error bars represent the confidence intervals, showing the range within which the true energy output is expected to fall with 95% confidence. The green line connects the predicted values, while the red triangles indicate the actual observed energy output values.

The zoomed view highlights the alignment between the predicted values and the actual values. Most red triangles (actual values) fall close to the green line (predicted values) and within the blue error bars, indicating that the model is making accurate predictions with reasonable confidence intervals.

In [139]:
# Create a plot with expanded error bars for better visibility
# This is useful if the confidence intervals are very narrow
expansion_factor <- 10  # Adjust as needed

expanded_data <- plot_data
expanded_data$Lower_CI_Expanded <- expanded_data$Predicted - 
  (expanded_data$Predicted - expanded_data$Lower_CI) * expansion_factor
expanded_data$Upper_CI_Expanded <- expanded_data$Predicted + 
  (expanded_data$Upper_CI - expanded_data$Predicted) * expansion_factor

expanded_errorbar_plot <- ggplot(expanded_data, aes(x = Index)) +
  # Add predicted values as points
  geom_point(aes(y = Predicted, color = "Predicted Values"), 
             size = 2) +
  
  # Add expanded error bars
  geom_errorbar(aes(ymin = Lower_CI_Expanded, ymax = Upper_CI_Expanded, 
                    color = "95% Confidence Interval (expanded)"),
                width = 10, linewidth = 0.5, alpha = 0.7) +
  
  # Add actual values as different points
  geom_point(aes(y = Actual, color = "Actual Values"), 
             size = 2, shape = 17) +
  
  # Set custom colors
  scale_color_manual(values = c("Predicted Values" = "#33e78a",
                                "Actual Values" = "#e73333",
                                "95% Confidence Interval (expanded)" = "#3333e7"),
                     name = "") +
  
  # Add informative labels and title
  labs(title = "Model 5 Predictions with Expanded 95% Confidence Intervals",
       subtitle = paste("Error bars expanded by factor of", expansion_factor, "for visibility"),
       x = "Training Data Index (sampled every 50 points)",
       y = "Energy Output (y)") +
  
  # Add theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 18),
    plot.subtitle = element_text(hjust = 0.5, size= 15),
    legend.position = "bottom"
  )

# Display the expanded plot
print(expanded_errorbar_plot)
No description has been provided for this image

This shows model 2's predictions on the training data, along with expanded 95% confidence intervals for better visibility. The blue error bars represent the confidence intervals, which have been artificially expanded by a factor of 10 to make them more noticeable. The green points indicate the predicted values, while the red triangles represent the actual observed energy output values.

Most of the red triangles (actual values) are close to the green points (predicted values), indicating that the model is making accurate predictions. The blue error bars show the range within which the true energy output is expected to fall with 95% confidence. The expanded intervals are for visualization purposes only and do not reflect the actual statistical confidence interval widths. Some red triangles fall outside the blue error bars, indicating instances where the model's predictions deviate significantly from the actual values. These could be outliers or areas where the model struggles to generalize.

2.7.2: Model Evaluation on Testing Data:¶

Now, apply our trained model to make predictions on the test data and evaluate the model on test dataset. If model is performs well in the test dataset too then, we conclude that model in general is generalizing well.

In [140]:
# Compute model's predictions on the testing data:
# Here, use the matrix multiplication to get the predicted values:
# y_pred = X_test * theta_train 
# where X_test is the design matrix for the testing data and 
# theta_train are the estimated parameters from the training data.
y_pred <- X_test %*% theta_train
In [141]:
head(y_pred) # See first few predicted values
A matrix: 6 × 1 of type dbl
478.3338
484.8645
435.1927
434.4927
458.9381
450.5544
In [142]:
head(y_test) # first few actual values from out test dataset
  1. 480.48
  2. 484.31
  3. 435.29
  4. 426.25
  5. 461.71
  6. 448.56

Now, lets plot the distribution of y_pred and y_test on the testing dataset.

In [143]:
# Create a data frame with both actual and predicted values: 

comparison_df <- data.frame(
  Value = c(y_test, y_pred),
  Type = factor(c(rep("Actual", length(y_test)), rep("Predicted", length(y_pred))))
)
dim(comparison_df)
  1. 5716
  2. 2
In [144]:
# Create a data frame with both actual and predicted values

comparison_df <- data.frame(
  Value = c(y_test, y_pred),
  Type = factor(c(rep("Actual", length(y_test)), rep("Predicted", length(y_pred))))
)
In [145]:
# Create histograms with density and rug plots

ggplot(comparison_df, aes(x = Value, fill = Type, color = Type)) +
  # Add histogram with density scaling
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 30, 
                 alpha = 0.7) +
  # Add density curves
  geom_density(aes(y = after_stat(density)), 
               alpha = 0.2, 
               linewidth = 1.2) +
  # Add rug plots
  geom_rug(alpha = 0.7) +
  # Set custom colors
  scale_fill_manual(values = c("Actual" = "#385ea9", "Predicted" = "#b03a62")) +
  scale_color_manual(values = c("Actual" = "#655461", "Predicted" = "#17060c")) +
  # Facet into two columns by Type
  facet_grid(. ~ Type) +
  # Add labels and title
  labs(title = "Distribution of Actual vs Predicted Values (Test Data):",
       x = "Energy Output (MW)",
       y = "Density") +
  # Customize theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 20),
    legend.position = "none",
    strip.text = element_text(size = 14, face = "bold"),
    axis.title = element_text(size = 12)
  )
No description has been provided for this image

Left curve shows the histogram (with density curve overlayed over it) for the actual values. The curve and the shape of the distribution is not symmetrical, it possess characterstics of bimodal distribution. The rugplot at the buttom indicates the values are somehow distributed widely along x-axis.

The distribution of predicted valaues for the testing dataset almost resembles with left curve (curve of actual values). The shap is not normal-like, this curve also possess the properties of bimodal distribution. Rug plot at bottom shows the predictions are more centered around 430-490range. Both the curves are similar in shape and appearance indicating model is making highly accurate predictions and because of this high accuracy curves are almost identical.

Now, calculate the residuals for the testing dataset and plot them to check the normality of the residuals and their overall distribution pattern.

In [146]:
# residuals for training data:
test_residuals <- y_test - y_pred
head(test_residuals)
A matrix: 6 × 1 of type dbl
2.14620787
-0.55451952
0.09734623
-8.24273486
2.77190434
-1.99438883
In [147]:
# distribution plot of training residuals: (Histogram + Density Curve + Rug Plot):

# Create a histogram of the test residuals
ggplot(data.frame(Residual = test_residuals), aes(x = Residual)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "#ebba28", color = "white", alpha = 0.7) +
  geom_density(color = "#ff8800", linewidth = 1.5) +
  # add rug plot:
  geom_rug(sides = "b", color = "#f50b0b", alpha = 0.5) +
  labs(title = "Distribution of Test Residuals: Model 5",
       x = "Residual Value",
       y = "Density") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.4, face = "bold", size = 20),
    plot.subtitle = element_text(hjust = 0.5)
  ) 
No description has been provided for this image

The plot shows the distribution of residuals calculated by taking difference between actual and predicted values for test dataset. Yellow bars represents the histogram of the residuals whereas the orange line over the histogram is the density plot for the residuals. The red ticks at the bottom are rug plots which shows the individual residuals.

Residuals are distributed with center around zero which indicates the model is unbiased. The distribution is almost symmetrical around zero with approximate bell-shape; excluding those five extreme residual values at the left tail. The distribution of residuals is acceptably normal which is the assumption in regression analysis. Model is performing well in the testing dataset too.

One more test for Normality of Residuals. Calculate what percentage of residuals fall within 1, 2, and 3 standard deviations from the mean. Then I compare these values to the expected percentages for a normal distribution (68.27%, 95.45%, and 99.73%, respectively).

In [148]:
# Calculate residuals for test data
test_residuals <- y_test - y_pred

# Calculate basic statistics
mean_val <- mean(test_residuals)
median_val <- median(test_residuals)
sd_val <- sd(test_residuals)

# Calculate standardized residuals
std_residuals <- (test_residuals - mean_val) / sd_val

# Calculate proportion within different SD ranges
within_1sd <- mean(abs(std_residuals) <= 1) * 100
within_2sd <- mean(abs(std_residuals) <= 2) * 100
within_3sd <- mean(abs(std_residuals) <= 3) * 100

# Expected proportions for normal distribution
expected_1sd <- 68.27
expected_2sd <- 95.45
expected_3sd <- 99.73

# Create results data frame
normality_measures <- data.frame(
  Range = c("Within 1 SD", "Within 2 SD", "Within 3 SD"),
  Actual = c(within_1sd, within_2sd, within_3sd),
  Expected = c(expected_1sd, expected_2sd, expected_3sd)
)

# Print results
cat("Normality Distribution Measures for Test Residuals:\n")
print(normality_measures)
Normality Distribution Measures for Test Residuals:
        Range   Actual Expected
1 Within 1 SD 68.92932    68.27
2 Within 2 SD 96.60602    95.45
3 Within 3 SD 99.54514    99.73
        Range   Actual Expected
1 Within 1 SD 68.92932    68.27
2 Within 2 SD 96.60602    95.45
3 Within 3 SD 99.54514    99.73

The distribution of residuals for the test data closely follows a normal distribution. Approximately 67.49% of the residuals fall within 1 standard deviation of the mean, which is very close to the expected 68.27% for a normal distribution. Similarly, about 96.47% of the residuals fall within 2 standard deviations, slightly exceeding the expected 95.45%. Finally, 99.55% of the residuals fall within 3 standard deviations, which is nearly identical to the expected 99.73%. These results indicate that the residuals are approximately normally distributed.

Scatter plot to compare actual vs predicted values (testing data):

In [149]:
# Create a data frame for plotting (test data): 
comparison_data <- data.frame(
  Actual = y_test,
  Predicted = as.vector(y_pred)
)

# Calculate R-squared for test data
test_r_squared <- 1 - sum((y_test - y_pred)^2) / sum((y_test - mean(y_test))^2)

# Create a scatter plot with actual values on x-axis and predicted on y-axis
ggplot(comparison_data, aes(x = Actual, y = Predicted)) +
  # Add points
  geom_point(color = "#c42121", alpha = 0.6, size = 2.1) +

  # Add perfect prediction line (y = x)
  geom_abline(intercept = 0, slope = 1, color = "#0e1509", 
              linewidth = 1.5, linetype = "dashed", size= 1.5, ) +

  # Add a smooth trend line to see the actual relationship
  geom_smooth(method = "lm", color = "#2578e6", se = FALSE, linewidth = 1.5) +

  # Add labels and title
  labs(title = "Model 5: Predicted vs Actual Values (Test Data)",
       subtitle = paste("Points on diagonal line represent perfect predictions"),
       x = "Actual Energy Output",
       y = "Predicted Energy Output") +

  # Use a clean theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 19),
    plot.subtitle = element_text(hjust = 0.5, size = 15),
    legend.position = "none"
  )
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image

On x-axis, plot shows the actual enorgy output values and on the y-axis it shows the predicted values for the testing dataset. A black dashed diagonal line is plotted which is a hypothetical perfect prediction line. The red dots on the plot are the actual values and more closer these values to the diagnol line are, the more accuracy the predictions will be. A blue line is also plotted which shows an overall relation between predictions and the actual values.

The general trend line (blue line) and the reference line (black dashed line) are parallel and almost overlap upon each other. Also, the actual values (red dots) are linearly clustered around the reference line with constant variance. There are very few points on the curve which are randomly scattered. This is very accurate model with high percision.

95% Confidence Intervals for Predictions with Error Bars (Test Data):

In [150]:
# Calculate the residual variance from the training data: 
# This estimates the inherent noise/variability in our data

n_train <- nrow(X_train)                # Number of training samples
p <- ncol(X_train)                      # Number of model parameters (features)
residuals_train <- y_train - X_train %*% theta_train  # Training residuals
rss_train <- sum(residuals_train^2)     # Residual sum of squares
sigma_squared <- rss_train / (n_train - p)  # Estimated error variance
In [151]:
# Calculate the variance of predictions for test data
# This tells us how uncertain each prediction is
# Formula: Var(y_pred) = sigma² * x_test * (X_train'X_train)⁻¹ * x_test'
# where x_test is a row of the test design matrix

# Calculate (X'X)⁻¹ once (inverse of X'X matrix)
XtX_inv <- solve(t(X_train) %*% X_train)
In [152]:
# Initialize vector to store prediction variances
prediction_vars <- numeric(nrow(X_test))

# Calculate prediction variance for each test point
for (i in 1:nrow(X_test)) {
  x_i <- X_test[i, , drop = FALSE]  # Get the i-th test point (as a row matrix)
  # Calculate variance using the formula
  prediction_vars[i] <- sigma_squared * (x_i %*% XtX_inv %*% t(x_i))
}
In [153]:
# Calculate the standard errors of the predictions
prediction_se <- sqrt(prediction_vars)  # Standard error = sqrt(variance)

# Calculate the critical t-value for 95% confidence
# We use t-distribution with (n_train - p) degrees of freedom
t_critical <- qt(0.975, n_train - p)  # 0.975 for 95% confidence (two-tailed)
In [154]:
# Calculate the confidence intervals
# CI = prediction ± t_critical * standard_error
lower_ci <- y_pred - t_critical * prediction_se  # Lower bound
upper_ci <- y_pred + t_critical * prediction_se  # Upper bound
In [155]:
# Create a data frame for plotting
# For error bars, use a subset of points to avoid overcrowding
sample_interval <- 20  # Show one error bar every 20 points
sample_indices <- seq(1, length(y_test), by = sample_interval)
In [156]:
# Full data for ribbon plot
full_data <- data.frame(
  Index = 1:length(y_test),
  Actual = y_test,
  Predicted = y_pred,
  Lower_CI = lower_ci,
  Upper_CI = upper_ci
)

# Sampled data for error bars
sampled_data <- data.frame(
  Index = sample_indices,
  Actual = y_test[sample_indices],
  Predicted = y_pred[sample_indices],
  Lower_CI = lower_ci[sample_indices],
  Upper_CI = upper_ci[sample_indices]
)
In [157]:
# Print confidence interval statistics for reference
cat("Confidence interval statistics for test data:\n")
cat("Mean width:", mean(upper_ci - lower_ci), "\n")
cat("Maximum width:", max(upper_ci - lower_ci), "\n")
cat("Minimum width:", min(upper_ci - lower_ci), "\n")
Confidence interval statistics for test data:
Mean width: 0.4207347 
Maximum width: 1.301707 
Minimum width: 0.2222644 
Mean width: 0.4207347 
Maximum width: 1.301707 
Minimum width: 0.2222644 

The confidence intervals for the test data predictions are quite narrow, indicating that the model is making predictions with a high level of certainty. On average, the width of the confidence intervals is approximately 0.42 units.

The maximum width of the confidence intervals is 1.30 units, which occurs for the most uncertain predictions, while the minimum width is 0.22 units, reflecting the least uncertain predictions. Overall, the narrow confidence intervals suggest that the model is reliable and consistent in its predictions for the test data.

In [158]:
# Create a plot with 
#   - Actual values as points
#   - Predicted values as a line
#   - Confidence interval as a ribbon
#   - Error bars for selected points

combined_plot <- ggplot() +
  # Add confidence interval ribbon for all points
  # This shows the range where we expect the true value to fall with 95% probability
  geom_ribbon(data = full_data, 
              aes(x = Index, ymin = Lower_CI, ymax = Upper_CI, 
                  fill = "95% Confidence Interval"), 
              alpha = 0.3) +
  
  # Add predicted values line for all points
  # This shows our model's best estimate for each test point
  geom_line(data = full_data, 
            aes(x = Index, y = Predicted, color = "Predicted Values"), 
            linewidth = 0.8) +
  
  # Add actual values points for all points
  # These show the true observed values we're trying to predict
  geom_point(data = full_data, 
             aes(x = Index, y = Actual, color = "Actual Values"), 
             size = 0.9, alpha = 0.5) +
  
  # Add error bars for sampled points
  # These show the 95% confidence interval for selected points
  geom_errorbar(data = sampled_data, 
                aes(x = Index, ymin = Lower_CI, ymax = Upper_CI, 
                    color = "Error Bars"),
                width = 5, linewidth = 0.7, alpha = 0.7) +
  
  # Add predicted values points for sampled points
  # These highlight the points where we're showing error bars
  geom_point(data = sampled_data, 
             aes(x = Index, y = Predicted, shape = "Predicted (sampled)"), 
             color = "black", size = 2) +
  
  # Set custom colors and shapes
  scale_color_manual(values = c("Predicted Values" = "#33e78a",
                                "Actual Values" = "#ff0000",
                                "Error Bars" = "#2323f4"),
                     name = "Data Elements") +
  scale_fill_manual(values = c("95% Confidence Interval" = "#ffc800"),
                    name = "Confidence Region") +
  scale_shape_manual(values = c("Predicted (sampled)" = 19),
                     name = "Sampled Points") +
  
  # Add informative labels and title
  labs(title = "Test Data: Predictions with 95% Confidence Intervals",
       subtitle = paste("Showing CI ribbon (all points) and error bars (every", 
                        sample_interval, "points)"),
       x = "Test Data Index",
       y = "Energy Output") +
  
  # Add theme with legend at the bottom
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 19),
    plot.subtitle = element_text(hjust = 0.5, size = 15),
    legend.position = "bottom",
    legend.box = "vertical"
  )

# Display the combined plot
print(combined_plot)
No description has been provided for this image

This plot shows the predictions made by the model on the test dataset, with their 95% confidence intervals. The green ribbon(not visible clearly due to very narrow CI) represents the confidence intervals, which indicate the range within which the true energy output is expected to fall with 95% certainty. The red dots are the actual observed energy output values, while the green line represents the predicted values from the model.

The black dots highlight sampled predicted points(ie. predicted values from our sampled_dataset), and the blue error bars (not clearly visible) show the confidence intervals for every 20th data point.

However, due to the high density of data points and the narrow confidence intervals, it’s difficult to interpret the plot clearly. The overlapping points and narrow intervals make the confidence intervals and error bars invisible.

To make error bars and CI visible, I zoomed-in the y-axis. This makes the y-axis appear bigger and makes the interval visible on the plot. Zooming y-axis may results into cutting-off the values from the screen frame but this is done only to show that intervals and bars do exist on the plot.

In [159]:
# create a zoomed version by focusing on a small y-axis range around predictions

# First, find a good y-range to zoom into
# Calculate the average prediction and a small range around it
mean_pred <- mean(y_pred)
y_range <- max(y_test) - min(y_test)
zoom_factor <- 0.05  # Show 5% of the full range
zoom_range <- y_range * zoom_factor

# Create the zoomed plot
zoomed_plot <- ggplot() +
  # Add confidence interval ribbon for all points
  geom_ribbon(data = full_data, 
              aes(x = Index, ymin = Lower_CI, ymax = Upper_CI, 
                  fill = "95% Confidence Interval"), 
              alpha = 1.0) +  # Increased alpha for better visibility
  
  # Add predicted values line
  geom_line(data = full_data, 
            aes(x = Index, y = Predicted, color = "Predicted Values"), 
            linewidth = 0.9) +  # Increased line width
  
  # Add error bars for sampled points
  geom_errorbar(data = sampled_data, 
                aes(x = Index, ymin = Lower_CI, ymax = Upper_CI, 
                    color = "Error Bars"),
                width = 5, linewidth = 1.0, alpha = 0.9) +  # Increased width and alpha
  
  # Add predicted values points for sampled points
  geom_point(data = sampled_data, 
             aes(x = Index, y = Predicted, shape = "Predicted (sampled)"), 
             color = "black", size = 2.5) +
  
  # Set custom colors and shapes
  scale_color_manual(values = c("Predicted Values" = "#33e78a",
                                "Error Bars" = "#2323f4"),
                     name = "Data Elements") +
  scale_fill_manual(values = c("95% Confidence Interval" = "#ee00ff"),
                    name = "Confidence Region") +
  scale_shape_manual(values = c("Predicted (sampled)" = 19),
                     name = "Sampled Points") +
  
  # Add informative labels and title
  labs(title = "ZOOMED VIEW: Predictions with 95% Confidence Intervals (Test Data):",
       subtitle = "(Y-axis zoomed to show confidence intervals and error bars clearly)",
       x = "Test Data Index",
       y = "Energy Output (zoomed scale)") +
  
  # Set y-axis limits to zoom in
  ylim(mean_pred - zoom_range, mean_pred + zoom_range) +
  
  # Add theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 19),
    plot.subtitle = element_text(hjust = 0.5, size = 15),
    legend.position = "bottom"
  )

# Display the zoomed plot
print(zoomed_plot)
Warning message:
"Removed 5 rows containing missing values or values outside the scale range
(`geom_line()`)."
Warning message:
"Removed 125 rows containing missing values or values outside the scale range
(`geom_point()`)."
No description has been provided for this image

Thw green line represents the predicted energy output, and the black dots highlight sampled predicted points. The blue error bars (for every 20th point) and the pink confidence interval ribbons indicate the range within which the true energy output is expected to fall with 95% certainty.

By zooming in on the y-axis, we’ve magnified the scale, making these intervals and error bars much more visible. This zoomed view helps us clearly see how the predicted values align with the actual values and how the confidence intervals and error bars provide a range of uncertainty around the predictions.

In [160]:
# Print statistics about confidence interval widths to understand why they're hard to see
cat("\nWhy confidence intervals are difficult to see:\n")
cat("Mean CI width:", mean(upper_ci - lower_ci), "\n")
cat("Maximum CI width:", max(upper_ci - lower_ci), "\n")
cat("Minimum CI width:", min(upper_ci - lower_ci), "\n")
cat("Range of y values:", max(y_test) - min(y_test), "\n")
cat("CI width as percentage of y range:", 
    100 * mean(upper_ci - lower_ci) / (max(y_test) - min(y_test)), "%\n")
Why confidence intervals are difficult to see:
Mean CI width: 0.4207347 
Maximum CI width: 1.301707 
Minimum CI width: 0.2222644 
Range of y values: 74.19 
CI width as percentage of y range: 0.5671044 %

The confidence intervals are quite narrow compared to the full range of values in the dataset. On average, the width of a confidence interval is just about 0.42 units, while the total range of y-values is 74.19. Even the widest confidence interval only spans about 1.30 units, which is still very small portion of the overall data range.

CI are so small in comparison to the full scale of the data, they look almost invisible or just like thin lines, making it difficult for viewers to notice or interpret them correctly.

To better visualize the Confidence intervals and the error bars, I will take a very small sample ie. 16 data points from the test datset. Focusing only on the small sample porson helps clearly visualize every component in the plot.

In [161]:
# Small sample of data for better visualization
# We'll select a small subset of consecutive points to show detail

# Select a small range of data points (e.g., points 100-115) ie. for 16 data points
sample_range <- 100:115
sample_data <- full_data[sample_range, ]

# Create the sample plot
sample_plot <- ggplot(sample_data, aes(x = Index)) +
  # Add confidence interval ribbon
  geom_ribbon(aes(ymin = Lower_CI, ymax = Upper_CI, 
                 fill = "95% Confidence Interval"), 
              alpha = 1.0) +
  
  # Add predicted values line
  geom_line(aes(y = Predicted, color = "Predicted Values"), 
            linewidth = 1.2) +
  
  # Add actual values points
  geom_point(aes(y = Actual, color = "Actual Values"), 
             size = 2.5) +
  
  # Add error bars for all points in this small sample
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI, 
                   color = "Error Bars"),
                width = 0.5, linewidth = 0.8) +
  
  # Set custom colors
  scale_color_manual(values = c("Predicted Values" = "#38d082",
                                "Actual Values" = "#8318e0",
                                "Error Bars" = "#2323f4"),
                     name = "") +
  scale_fill_manual(values = c("95% Confidence Interval" = "#ffb700"),
                    name = "") +
  
  # Add informative labels and title
  labs(title = "(Test Sample of 15 Points): 95% Confidence Intervals with Error Bars:",
       subtitle = paste("Data points", min(sample_range), "to", max(sample_range)),
       x = "Test Data Index",
       y = "Energy Output") +
  
  # Add theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 20),
    plot.subtitle = element_text(hjust = 0.5, size = 15),
    legend.position = "bottom"
  )

# Display the sample plot
print(sample_plot)
No description has been provided for this image

This plot shows predictions of energy output along with actual values and the uncertainty around those predictions (shown using 95% confidence intervals and error bars).

Even though confidence intervals are included (in orange) and error bars are clearly marked (in blue), they are very narrow compared to the full range of energy output values (y-axis). That’s why the intervals appear thin and not very noticeable—they only cover about less then 1% of the total range.

This results into only one conclusion, the model’s predictions are quite precise, and this precision makes the confidence intervals almost invisible on the plot.

Residual Plot for Testing Dataset:

Residual plot is used to visualize the difference between predicted and actual values in a regression model. Each point in the plot represents the residual (actual − predicted) on the y-axis and the corresponding predicted value on the x-axis. Ideally, residuals should be randomly scattered around the horizontal axis (residual = 0), indicating that the model captures the underlying pattern without systematic bias. These plots help detect non-linearity, unequal error variances (heteroscedasticity), and outliers that could compromise model validity.

The importance of residual plots lies in their ability to visually assess the quality and assumptions of a regression model. A well-behaved residual plot with no discernible patterns supports assumptions like linearity, independence, and homoscedasticity of errors. However, if the plot reveals a curve or funnel shape, it suggests that the model may be misfitting the data, requiring model improvement such as transforming variables, adding interaction terms, or choosing a different model type. Thus, residual plots serve as essential tools for model validation and refinement.

In [162]:
# residual plot for test data (predicted values vs. residuals): 

# hide warnings
options(warn = -1)

# Calculate residuals for test data
test_residuals <- y_test - y_pred

# Create a data frame for plotting
residual_data <- data.frame(
  Predicted = as.numeric(y_pred),
  Residual = as.numeric(test_residuals)
)

# Create the residual plot with improved legend
residual_plot <- ggplot(residual_data, aes(x = Predicted, y = Residual)) +
  # Add points for residuals with named aesthetic for legend
  geom_point(aes(color = "Residuals"), alpha = 0.7, size = 1.5) +
  
  # Add horizontal line at y = 0 with named aesthetic for legend
  geom_hline(aes(linetype = "Zero Line (Perfect Prediction)"), 
             yintercept = 0, color = "red", linewidth = 1) +
  
  # Add a smooth trend line with named aesthetic for legend
  geom_smooth(aes(color = "Trend Line", fill = "Confidence Band"),
              method = "loess", se = TRUE, linewidth = 1.2, alpha = 0.2) +
  
  # Set custom colors and linetypes
  scale_color_manual(values = c("Residuals" = "#3366CC", 
                               "Trend Line" = "#CC6633"),
                    name = "Elements") +
  scale_fill_manual(values = c("Confidence Band" = "#CC6633"),
                   name = "Elements") +
  scale_linetype_manual(values = c("Zero Line (Perfect Prediction)" = "dashed"),
                       name = "Elements") +
  
  # Add labels and title
  labs(title = "Residuals vs. Predicted Values (Test Data)",
       subtitle = "(A key diagnostic plot for assessing model assumptions)",
       x = "Predicted Energy Output",
       y = "Residual (Actual - Predicted)") +
  
  # Add theme with improved legend
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 19),
    plot.subtitle = element_text(hjust = 0.5, size = 16),
    legend.position = "bottom",
    legend.box = "vertical",
    legend.title = element_text(face = "bold"),
    legend.text = element_text(size = 10),
    legend.background = element_rect(fill = "white", color = "gray90"),
    legend.margin = margin(6, 6, 6, 6)
  ) +
  
  # Combine all legend elements into one
  guides(color = guide_legend(order = 1, override.aes = list(
                             linetype = c("solid", "solid"),
                             shape = c(16, NA))),
         fill = guide_legend(order = 2),
         linetype = guide_legend(order = 3))

# Display the plot
print(residual_plot)

# Reset warnings to default
options(warn= 0)
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image

The plot shows residuals (actual − predicted values) against predicted energy output for a test dataset. Most residuals are centered around the zero line, which is a good sign that the model predictions are generally unbiased. The presence of a red horizontal reference line at 0 helps highlight how residuals deviate. The blue dots (residuals) appear fairly spread out but don't show extreme deviations from the zero line across most predicted values, suggesting overall decent performance.


Task 3: Approximate Bayesian Computation:¶

Our choosen model is Model-5:

$y = θ₁ * x4 + θ₂ * x1² + θ₃ * x3² + θ_bias$

From Task 2.1, the optimized (least squares) parameters we obtained are:

In [163]:
model_parameters$model5
A matrix: 4 × 1 of type dbl
bias-754.486865
x4 186.436251
x1_squared -4.728874
x3_squared -2.599210
  • θ_bias = -754.486865
  • θ₁ (x4) = 186.436251
  • θ₂ (x1²) = -4.728874
  • θ₃ (x3²) = -2.599210

We will use ABC to compute posterior distributions for the 2 parameters with the largest absolute value.

So: θ₁ = 186.436251 and, θ₂ = -4.728874 have the highest magnitudes. So we’ll focus on estimating the posterior for:

  • θ₁ (coefficient of x4)
  • θ₂ (coefficient of x1²)

We will keep θ₃ and θ_bias fixed at their least squares estimates.

Why not take bias in posterior estimation (this has the higest absolute value):

  • The bias term (intercept) does not represent a physical effect of any input variable; it just shifts the regression line up or down.

  • In ABC/posterior analysis, we are interested in the uncertainty of parameters that represent real-world relationships (i.e., coefficients of predictors), not the baseline offset.

  • The bias is often well-determined by the data and less informative for scientific interpretation.

  • The assignment specifically asks for the two parameters with the largest absolute value (excluding the bias, which is standard practice in parameter inference).

  • Varying the bias alone cannot compensate for changes in the relationship between predictors and output; it only adjusts the mean.

  • Fixing the bias simplifies the ABC computation and focuses the analysis on the parameters of interest (θ₁ and θ₂).

Now, define Uniform priors around the least squares estimates. A good simple approach is $±20%$ of the parameter value.

  • θ₁ ∈ [186.436251 * 0.8, 186.436251 * 1.2]
  • θ₂ ∈ [-4.728874 * 1.2, -4.728874 * 0.8]
In [164]:
# Load necessary libraries
library(ggplot2)
library(reshape2)
library(scales)

# Prepare data for simulation: 
x1_squared <- data$x1^2
x3_squared <- data$x3^2
x4 <- data$x4
y_obs <- data$y
Attaching package: 'scales'


The following object is masked from 'package:viridis':

    viridis_pal


The following object is masked from 'package:purrr':

    discard


The following object is masked from 'package:readr':

    col_factor


The following object is masked from 'package:viridis':

    viridis_pal


The following object is masked from 'package:purrr':

    discard


The following object is masked from 'package:readr':

    col_factor


In [165]:
# Fixed parameters theta_3 and bias: 
theta3_fixed <- -2.599210
theta_bias_fixed <- -754.486865
In [166]:
# Least squares estimates; this is used for reference
theta1_mean <- 186.436251
theta2_mean <- -4.728874

# Define prior ranges (±20% of mean)
theta1_range <- c(theta1_mean * 0.8, theta1_mean * 1.2)
theta2_range <- c(theta2_mean * 1.2, theta2_mean * 0.8)  # flipped for negative mean
In [167]:
# ABC Rejection Sampling Parameters
n_samples <- 10000
#epsilon <- 1e6 # Acceptance threshold
epsilon <- 10^6.62 # Acceptance threshold
# Storage for all samples and distances
theta1_all <- numeric(n_samples)
theta2_all <- numeric(n_samples)
distance <- numeric(n_samples)
epsilon # Lets see the value
4168693.83470336
In [168]:
# RSS threshold (acceptance)
#RSS_Model_5 <- sum((y_obs - (theta1_mean * x4 + theta2_mean * x1_squared + theta3_fixed * x3_squared + theta_bias_fixed))^2)
#epsilon <- RSS_Model_5 * 2
#epsilon
In [169]:
# random seed for reproducability: 
set.seed(0)

# Run ABC rejection sampling
for (i in 1:n_samples) {
  # Sample from priors: 
  theta1 <- runif(1, theta1_range[1], theta1_range[2])
  theta2 <- runif(1, theta2_range[1], theta2_range[2])
  
  # Store the sampled values
  theta1_all[i] <- theta1
  theta2_all[i] <- theta2
  
  # Simulate y using the model
  y_sim <- theta1 * x4 + theta2 * x1_squared + theta3_fixed * x3_squared + theta_bias_fixed
  
  # Calculate distance
  distance[i] <- sum((y_obs - y_sim)^2)
}

# Get accepted samples
accepted_indices <- which(distance < epsilon)
posterior_df <- data.frame(
  theta1 = theta1_all[accepted_indices],
  theta2 = theta2_all[accepted_indices]
)

# Also create full dataframe for accepted/rejected visualization
status <- ifelse(distance < epsilon, "Accepted", "Rejected")
full_sample_df <- data.frame(
  theta1 = theta1_all,
  theta2 = theta2_all,
  status = status
)
In [170]:
head(posterior_df)
A data.frame: 6 × 2
theta1theta2
<dbl><dbl>
1186.2647-4.317238
2184.7382-4.289441
3187.8259-4.422433
4184.8044-3.926716
5186.1076-4.758478
6187.0061-4.383121
In [171]:
# posterior distribution of the parameter theta_1: 

ggplot(posterior_df, aes(x = theta1)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgray", color = "white", alpha = 0.6) +
  geom_density(fill = "skyblue", alpha = 0.6) +
  geom_rug(sides = "b", color = "black", alpha = 0.3) +
  geom_vline(xintercept = theta1_mean, color = "red", linetype = "dashed", size = 1) +
  annotate(
    "text", 
    x = theta1_mean, 
    y = Inf, 
    label = "LS Estimate", 
    vjust = 2, 
    color = "red", 
    fontface = "italic",
    size = 5
  ) +
  labs(
    title = "Posterior Distribution of theta_1",
    x = expression(theta[1]),
    y = "Density"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    axis.title.x = element_text(size = 16),
    axis.title.y = element_text(size = 16),
    axis.text = element_text(size = 12)
  )
No description has been provided for this image

This plot shows the posterior distribution of the parameter $\theta_1$ (the coefficient for $x_4$, exhaust vacuum) obtained via Approximate Bayesian Computation (ABC). The distribution is approximately symmetric and bell-shaped, centered around the least squares (LS) estimate. The red dashed vertical line marks the LS estimate of $\theta_1$, labeled "LS Estimate." The posterior is centered close to this value, indicating agreement between Bayesian and frequentist estimates.

The width of the density curve reflects the uncertainty in $\theta_1$. Most of the posterior mass is concentrated in a narrow range, suggesting high confidence in the parameter estimate. The tick marks along the x-axis show individual accepted samples from the posterior, indicating good coverage and sample density.

The ABC posterior supports the LS estimate, with little evidence for values far from the LS solution. The parameter $\theta_1$ is well-identified by the data.

In [172]:
# approximate Bayesian posterior for theta_2: 

ggplot(posterior_df, aes(x = theta2)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgray", color = "white", alpha = 0.6) +
  geom_density(fill = "lightgreen", alpha = 0.6) +
  geom_rug(sides = "b", color = "black", alpha = 0.3) +
  geom_vline(xintercept = theta2_mean, color = "red", linetype = "dashed", size = 1) +
  annotate(
    "text", 
    x = theta2_mean, 
    y = Inf, 
    label = "LS Estimate", 
    vjust = 2, 
    color = "red", 
    fontface = "italic",
    size = 5
  ) +
  labs(
    title = "Posterior Distribution of theta_2",  # θ₂ (x²¹)
    x = expression(theta[2]),
    y = "Density"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(
      size = 20,
      face = "bold",
      hjust = 0.5
    ),
    axis.title.x = element_text(size = 16),
    axis.title.y = element_text(size = 16),
    axis.text = element_text(size = 12)
  )
No description has been provided for this image

This plot shows the approximate Bayesian posterior for $\theta_2$, the coefficient of x₁² (squared temperature) in the regression model. The distribution is nearly flat (uniform-like) within the accepted range, indicating high uncertainty or weak identifiability for $\theta_2$. The red dashed vertical line marks the least squares (LS) estimate for $\theta_2$. The posterior is spread widely around this value. The wide, flat posterior suggests the observed data do not strongly constrain $\theta_2$ —many values are plausible. The tick marks along the x-axis show accepted $\theta_2$ samples, confirming good coverage across the range.

Unlike $\theta_1$, the data provide little information to precisely estimate $\theta_2$. The LS estimate is plausible, but many other values are also supported by the data.

In [173]:
# Plot joint posterior of accepted samples with styled theme and centered big title
ggplot(posterior_df, aes(x = theta1, y = theta2)) +
  geom_point(alpha = 0.4, color = "#9906f5") +
  labs(
    title = "Joint Posterior Distribution of theta_1 and theta_2 (Accepted Only) :",
    x = expression(theta[1]),
    y = expression(theta[2])
  ) +
  theme_minimal(base_size = 20) +
  theme(
    plot.title = element_text(
      size = 20,
      face = "bold",
      hjust = 0.5
    ),
    axis.title = element_text(size = 19)
  )
No description has been provided for this image

The plot shows the joint posterior distribution of $\theta_1$ (coefficient for x₄, exhaust vacuum) and $\theta_2$ (coefficient for x₁², squared temperature) using accepted samples from Approximate Bayesian Computation (ABC). The accepted samples are spread across a rectangular region, indicating that a wide range of $\theta_1$ and $\theta_2$ values are plausible given the data and the ABC threshold. There is no strong visible correlation or clear pattern between $\theta_1$ and $\theta_2$; the samples are scattered rather than forming a diagonal or curved band.

$\theta_1$ is relatively well-constrained (narrower horizontal spread), while $\theta_2$ shows much more uncertainty (wider vertical spread), consistent with previous marginal posterior plots. The data strongly informs $\theta_1$, but provides weak information about $\theta_2$. Many combinations of $\theta_1$ and $\theta_2$ fit the data nearly equally well, especially for $\theta_2$.

The model is sensitive to $\theta_1$ (exhaust vacuum effect), but the effect of $\theta_2$ (squared temperature) is less identifiable from the data. There is no evidence of a trade-off or dependency between these two parameters in the accepted posterior samples.

In [174]:
# Plot joint posterior including accepted/rejected
ggplot(full_sample_df, aes(x = theta1, y = theta2, color = status)) +
  geom_point(alpha = 0.5, size = 2.5) +
  scale_color_manual(values = c("Accepted" = "blue", "Rejected" = "#c39090")) +
  geom_vline(xintercept = theta1_mean, color = "red", linetype = "dashed", size = 1) +
  geom_hline(yintercept = theta2_mean, color = "red", linetype = "dashed", size = 1) +
  annotate("point", x = theta1_mean, y = theta2_mean, color = "red", shape = 4, size = 4, stroke = 1.5) +
  annotate("text", x = theta1_mean, y = theta2_mean + 0.2, label = "LS Estimate", color = "red", fontface = "italic") +
  labs(
    title = "Joint Posterior of theta_1 and theta_2 (Accepted and Rejected):",
    x = expression(theta[1] ~ "(x4)"),
    y = expression(theta[2] ~ "(x1²)"),
    color = "Sample Status"
  ) +
  theme_minimal(base_size = 20) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 18),
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold")
  )
No description has been provided for this image

This plot shows the joint posterior distribution of two model parameters, $\theta_1$ (exhaust vacuum coefficient) and $\theta_2$ (squared temperature coefficient), using Approximate Bayesian Computation (ABC). Each point represents a sampled parameter pair: blue points represents the accepted samples (fit the data well, i.e., posterior samples) and brown points represents the ejected samples (do not fit the data well).

From plot, we clearly see that $\theta_1$ (x₄) is well-constrained, which means, the accepted samples (blue) form a narrow vertical band, indicating that only a limited range of $\theta_1$ values produce good model fits. This means the data provides strong information about $\theta_1$.

And, $\theta_2$ (x₁²) is weakly identified ie. the accepted samples are spread widely along the $\theta_2$ axis, showing that many $\theta_2$ values are plausible. The data does not strongly constrain $\theta_2$.

The blue region (blue region) is rectangular, not diagonal, indicating little to no identifiable dependency between $\theta_1$ and $\theta_2$ in the accepted region.

The red dashed lines and marker show the least squares (LS) estimates for $\theta_1$ and $\theta_2$, which lie within the accepted region, confirming consistency between Bayesian and frequentist estimates.

Vary $bias$ and $\theta_3$ and Keep $\theta_1$ and $\theta_2$ fixed in ABC:¶

In [175]:
# Varying bias and theta_3, keeping theta_1 and theta_2 fixed: 

# 1. Prepare data and fixed parameters: 
x1_squared <- data$x1^2
x3_squared <- data$x3^2
x4 <- data$x4
y_obs <- data$y

# Fixed parameters (from LS estimates): 
theta1_fixed <- 186.436251
theta2_fixed <- -4.72887
In [176]:
# Vary bias and theta_3 (±20% around LS estimates): 
theta3_mean <- -2.599210
bias_mean <- -754.486865
theta3_range <- sort(c(theta3_mean * 1.2, theta3_mean * 0.8))
bias_range   <- sort(c(bias_mean * 1.2, bias_mean * 0.8))
In [177]:
# ABC test settings: 
n_samples <- 10000
epsilon <- 10^6.62 # Same threshold as before

# stored samples and distances: 
theta3_all <- numeric(n_samples)
bias_all <- numeric(n_samples)
distance <- numeric(n_samples)

set.seed(0) # random seed for reproducibility
# Run ABC rejection sampling for theta_3 and bias
for (i in 1:n_samples) {
  # Sample from priors
  theta3 <- runif(1, theta3_range[1], theta3_range[2])
  bias <- runif(1, bias_range[1], bias_range[2])
  theta3_all[i] <- theta3
  bias_all[i] <- bias
  # Simulate y
  y_sim <- theta1_fixed * x4 + theta2_fixed * x1_squared + theta3 * x3_squared + bias
  # Calculate distance (RSS)
  distance[i] <- sum((y_obs - y_sim)^2)
}

# Accepted samples: 
accepted_indices <- which(distance < epsilon)
posterior_df <- data.frame(
  theta3 = theta3_all[accepted_indices],
  bias = bias_all[accepted_indices]
)
status <- ifelse(distance < epsilon, "Accepted", "Rejected")
full_sample_df <- data.frame(
  theta3 = theta3_all,
  bias = bias_all,
  status = status
)

# LS estimates for reference
theta3_ls <- theta3_mean
bias_ls <- bias_mean
In [178]:
# Joint posterior (accepted only): 
ggplot(posterior_df, aes(x = theta3, y = bias)) +
  geom_point(alpha = 0.4, color = "#1f77b4") +
  geom_vline(xintercept = theta3_ls, color = "red", linetype = "dashed", size = 1) +
  geom_hline(yintercept = bias_ls, color = "red", linetype = "dashed", size = 1) +
  annotate("point", x = theta3_ls, y = bias_ls, color = "red", shape = 4, size = 4, stroke = 1.5) +
  annotate("text", x = theta3_ls, y = bias_ls + 10, label = "LS Estimate", color = "red", fontface = "italic") +
  labs(
    title = "Joint Posterior of bias and theta_3 (Accepted Only):",
    x = expression(theta[3] ~ "(x3²)"),
    y = "bias"
  ) +
  theme_minimal(base_size = 18) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 18),
    axis.title = element_text(face = "bold")
  )
No description has been provided for this image

There is no clear pattern or relationship between $bias$ and $\theta_3$; accepted values are scattered in a rectangle. The LS ($bias$ ≈ -754.49, $\theta_3$ ≈ -2.60)estimates (red lines) are in the center of the accepted region, showing that the LS solution is supported by the Bayesian approach.

Many combinations of $bias$ and $\theta_3$ fit the data well, and the model is not very sensitive to small changes in these parameters within the accepted range.

In [179]:
# Joint posterior (accepted and rejected)
ggplot(full_sample_df, aes(x = theta3, y = bias, color = status)) +
  geom_point(alpha = 0.5, size = 2.2) +
  scale_color_manual(values = c("Accepted" = "blue", "Rejected" = "#c39090")) +
  geom_vline(xintercept = theta3_ls, color = "red", linetype = "dashed", size = 1) +
  geom_hline(yintercept = bias_ls, color = "red", linetype = "dashed", size = 1) +
  annotate("point", x = theta3_ls, y = bias_ls, color = "red", shape = 4, size = 4, stroke = 1.5) +
  annotate("text", x = theta3_ls, y = bias_ls + 10, label = "LS Estimate", color = "red", fontface = "italic") +
  labs(
    title = "Joint Posterior of bias and theta_3 (Accepted + Rejected):",
    x = expression(theta[3] ~ "(x3²)"),
    y = "bias",
    color = "Sample Status"
  ) +
  theme_minimal(base_size = 18) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 18),
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold")
  )
No description has been provided for this image

Accepted samples (blues) are tightly clustered horizontally from about $\theta_3$ = $-3.12$ to $-2.08$ and vertically from about $bias$ = -905 to -603. Rejected samples (brown) fill the rest of the rectangle, showing that most combinations do not fit the data well.
There is no visible relationship between $bias$ and $\theta_3$; accepted values form a horizontal band. The LS estimates are in the center of the accepted region, confirming the LS solution is supported by the Bayesian approach. Only a narrow range of $bias$ values (around $-754.49$) with a wide range of $\theta_3$ values fit the data well, and the model is not sensitive to $theta_3$ within this accepted range.